Thursday, August 5, 2010

Adventures in performance tuning on Google App Engine

Over the last couple of months I've been dabbling with a home-spun API for the Madison Metro transit system. It is based on the work that I did building SMSMyBus, the mobile app that lets you find bus arrival estimates in real time using Twilio's SMS API. When I built that app, I simply scraped data from the Metro's website. 

But when I was finished, it immediately became clear that there were lots of other applications that needed to be built using this (and other) transit data. A good example, is the status monitor on display at the Mother Fool's coffee house. So I set out to build an API that would enable anyone to build new transit apps without having to implement the ugly screen scraping techniques. After documenting a draft of the API, I set off trying to implement the server.

Just as I did with SMSMyBus, I built the API server on Google App Engine. After grinding through the pains of scraping poorly formed data on the Metro site, I immediately ran into performance problems and started blowing through my quota. And I did this without even turning on all of the routes in the system. I was forced to actually study the GAE APIs in more detail.

This post is intended to share the experience of tuning the performance, how I measured bottlenecks, and what I did to fix them.

Problem definition

The heart of the API is the continual consumption of a fire hose of data at Madison Metro. This was accomplished using is a list of cron jobs that scrapes location information for every route in the city.

The prefetch handler parses a text file for an individual route that is read from a URL. Each entry is categorized by stop ID or vehicle ID and the arrival estimate and status models are created accordingly. In the case of the longer routes, this could be as many as 480 status entries.

Step one : Identifying the bottlenecks with the Quota API

The first version of the prefetch routine focused exclusively on just getting my hands on the useful information. I didn't pay any attention to the use of datastore calls. I just wanted to get the regular expressions right, create new model entities and shove it in the datastore.

This approach worked just fine on the local dev server, but I was consistently hitting DeadlineExceeded exceptions in the production environment. I needed to identify which elements of the prefetcher were expensive. Enter the Quota API...

    from google.appengine.api import quota

    start = quota.get_request_cpu_usage() 
    # do all the magic 
    end = quota.get_request_cpu_usage() 
    logging.warning("The magic took %s cycles" % (end - start)) 

Assuming GAE was behaving correctly, I started with my own code, the parsing of the fire hose. That did not provide fruitful so I started to add up all the time I was spending in API calls. Although individual calls were small, they quickly added up over hundreds of calls. It became obvious that this type of serial access to the datastore was a contributing factor.

Step two : Batch puts

Previously, the apps I had been building did not require bulk datastore updates so I glossed over that section of the API. But it's easy to overlook. The db.put() function supports a list as an argument. So rather than storing new entities like this, status.put(), I collected a list of model instances in the main loop and followed it with, db.put(statusList). 

This had a dramatic impact on the overall performance. The time it took to loop through the route file went from 12,000 megacycles to 2,000 megacycles (!).

Step three : Memcache tricks

Even after the improvement, I still noticed that I was spending just as much time parsing the file as I was storing results, and that didn't seem possible. After more quota probing, I discovered that I was fetching the same StopLocation entities repeatedly when I needed to find details about a particular stop.

However, when I started memcaching these entities I was disappointed by the overall performance improvement. And this led to perhaps the most revealing aspect of this entire process. The following memcache pattern is slow... 

    # loop over hundreds of entries in the fire hose 
    for e in firehose: 
        stopID = getStop(e) 
        stopEntity = memcache.get(stopID) 
        if stopEntity is None: 
            stopEntity = getItFromDatastore(stopID)
        # parse firehose entry and do other magic

In fact, very slow when you aggregate it over hundreds of accesses. Even if you get a 100% hit rate in the memcache.

The good news is that it led me down another useful API path for the memcache. Just like the batch puts for model entities, you can set and get lists of objects. Furthermore, you can get and set multiple key values with one call - set_multi() and get_multi(). 

This again led to dramatic performance improvements. The time it took to loop through the route file went from 2,000 megacycles (after the batch put optimization) to 600 megacycles. 

Step four : Install Appstats

I still wasn't happy with my performance so I went fishing for more resources and found Appstats. Truly a great resource, but it's something that should be used at the beginning of the process and not the end. :) It gives you a quick overview of how efficient you are (or aren't) being with the datastore. In my case, I had already optimized as best I could, but this tool is now in the front pocket for all future projects.

 Lessons Learned

  1. Become familiar with the quota package. It's the single best way to get granular measurements about where your app is spending its time. I'd link to the API, but I can't!?! It seems to be the only GAE API that isn't documented. The best resource I've seen is the monitoring section found on the platform quota page.

  2. Install Appstat event recorders in every GAE application you create. It's the quickest way learn - at a high level - how effective your memcaching strategy is.

  3. Familiarize yourself with the entire suite of functions in the APIs you are using. Even if you don't know the details of every call, it will help your engineering when you can recall every tool in your toolbox.

  4. Memcache is your friend. Your best friend actually. This might be stating the obvious, but stated nonetheless.

  5. Memcache access is still slow when you aggregate lots of accesses. Use batch processing - even for the memcache.

  6. The taskqueue is the best way to cheat the thirty second handler limit. I didn't talk a lot about this specifically, but if you chunk up problems into smaller ones, the task queue can help you overcome the inherit time restrictions within GAE.

The part I left out... even the best effort can't overcome a bad idea

Even after all of this analysis, tweaking and optimization, I wasn't quite at my goal. I've optimized the primary worker loop, cached all the data that wasn't changing, and optimized the expensive operation of storing new data. Things were better, but they were also more spread out. Instead lots of small costly operations, I had a handful of still costly operations. And operations that were now outside my control.

Running this algorithm for 80+ routes still wasn't feasible without considerable hiking of my billable quota. I've resolved myself to believing that although App Engine is a terrific tool for building web apps, it was never intended to build apps like this. I remain skeptical that they've resolved their datastore performance problems as they've stated they have.

It's also worth pointing out that there is a huge knowledge void when it comes to the Quota API. It is not documented well - both in terms of its use as well as an understanding of how to interpret the results.

I'm going back to the drawing board to determine if there is a better way to access and cache this transit data. Who knows, maybe Madison Metro can save all of us the trouble and just build an API for the existing data! Wouldn't that be groovy?