Thursday, August 5, 2010

Adventures in performance tuning on Google App Engine

Over the last couple of months I've been dabbling with a home-spun API for the Madison Metro transit system. It is based on the work that I did building SMSMyBus, the mobile app that lets you find bus arrival estimates in real time using Twilio's SMS API. When I built that app, I simply scraped data from the Metro's website. 

But when I was finished, it immediately became clear that there were lots of other applications that needed to be built using this (and other) transit data. A good example, is the status monitor on display at the Mother Fool's coffee house. So I set out to build an API that would enable anyone to build new transit apps without having to implement the ugly screen scraping techniques. After documenting a draft of the API, I set off trying to implement the server.

Just as I did with SMSMyBus, I built the API server on Google App Engine. After grinding through the pains of scraping poorly formed data on the Metro site, I immediately ran into performance problems and started blowing through my quota. And I did this without even turning on all of the routes in the system. I was forced to actually study the GAE APIs in more detail.

This post is intended to share the experience of tuning the performance, how I measured bottlenecks, and what I did to fix them.

Problem definition

The heart of the API is the continual consumption of a fire hose of data at Madison Metro. This was accomplished using is a list of cron jobs that scrapes location information for every route in the city.

The prefetch handler parses a text file for an individual route that is read from a URL. Each entry is categorized by stop ID or vehicle ID and the arrival estimate and status models are created accordingly. In the case of the longer routes, this could be as many as 480 status entries.

Step one : Identifying the bottlenecks with the Quota API

The first version of the prefetch routine focused exclusively on just getting my hands on the useful information. I didn't pay any attention to the use of datastore calls. I just wanted to get the regular expressions right, create new model entities and shove it in the datastore.

This approach worked just fine on the local dev server, but I was consistently hitting DeadlineExceeded exceptions in the production environment. I needed to identify which elements of the prefetcher were expensive. Enter the Quota API...

    from google.appengine.api import quota

    start = quota.get_request_cpu_usage() 
    # do all the magic 
    end = quota.get_request_cpu_usage() 
    logging.warning("The magic took %s cycles" % (end - start)) 

Assuming GAE was behaving correctly, I started with my own code, the parsing of the fire hose. That did not provide fruitful so I started to add up all the time I was spending in API calls. Although individual calls were small, they quickly added up over hundreds of calls. It became obvious that this type of serial access to the datastore was a contributing factor.

Step two : Batch puts

Previously, the apps I had been building did not require bulk datastore updates so I glossed over that section of the API. But it's easy to overlook. The db.put() function supports a list as an argument. So rather than storing new entities like this, status.put(), I collected a list of model instances in the main loop and followed it with, db.put(statusList). 

This had a dramatic impact on the overall performance. The time it took to loop through the route file went from 12,000 megacycles to 2,000 megacycles (!).

Step three : Memcache tricks

Even after the improvement, I still noticed that I was spending just as much time parsing the file as I was storing results, and that didn't seem possible. After more quota probing, I discovered that I was fetching the same StopLocation entities repeatedly when I needed to find details about a particular stop.

However, when I started memcaching these entities I was disappointed by the overall performance improvement. And this led to perhaps the most revealing aspect of this entire process. The following memcache pattern is slow... 

    # loop over hundreds of entries in the fire hose 
    for e in firehose: 
        stopID = getStop(e) 
        stopEntity = memcache.get(stopID) 
        if stopEntity is None: 
            stopEntity = getItFromDatastore(stopID)
        # parse firehose entry and do other magic

In fact, very slow when you aggregate it over hundreds of accesses. Even if you get a 100% hit rate in the memcache.

The good news is that it led me down another useful API path for the memcache. Just like the batch puts for model entities, you can set and get lists of objects. Furthermore, you can get and set multiple key values with one call - set_multi() and get_multi(). 

This again led to dramatic performance improvements. The time it took to loop through the route file went from 2,000 megacycles (after the batch put optimization) to 600 megacycles. 

Step four : Install Appstats

I still wasn't happy with my performance so I went fishing for more resources and found Appstats. Truly a great resource, but it's something that should be used at the beginning of the process and not the end. :) It gives you a quick overview of how efficient you are (or aren't) being with the datastore. In my case, I had already optimized as best I could, but this tool is now in the front pocket for all future projects.

 Lessons Learned

  1. Become familiar with the quota package. It's the single best way to get granular measurements about where your app is spending its time. I'd link to the API, but I can't!?! It seems to be the only GAE API that isn't documented. The best resource I've seen is the monitoring section found on the platform quota page.

  2. Install Appstat event recorders in every GAE application you create. It's the quickest way learn - at a high level - how effective your memcaching strategy is.

  3. Familiarize yourself with the entire suite of functions in the APIs you are using. Even if you don't know the details of every call, it will help your engineering when you can recall every tool in your toolbox.

  4. Memcache is your friend. Your best friend actually. This might be stating the obvious, but stated nonetheless.

  5. Memcache access is still slow when you aggregate lots of accesses. Use batch processing - even for the memcache.

  6. The taskqueue is the best way to cheat the thirty second handler limit. I didn't talk a lot about this specifically, but if you chunk up problems into smaller ones, the task queue can help you overcome the inherit time restrictions within GAE.

The part I left out... even the best effort can't overcome a bad idea

Even after all of this analysis, tweaking and optimization, I wasn't quite at my goal. I've optimized the primary worker loop, cached all the data that wasn't changing, and optimized the expensive operation of storing new data. Things were better, but they were also more spread out. Instead lots of small costly operations, I had a handful of still costly operations. And operations that were now outside my control.

Running this algorithm for 80+ routes still wasn't feasible without considerable hiking of my billable quota. I've resolved myself to believing that although App Engine is a terrific tool for building web apps, it was never intended to build apps like this. I remain skeptical that they've resolved their datastore performance problems as they've stated they have.

It's also worth pointing out that there is a huge knowledge void when it comes to the Quota API. It is not documented well - both in terms of its use as well as an understanding of how to interpret the results.

I'm going back to the drawing board to determine if there is a better way to access and cache this transit data. Who knows, maybe Madison Metro can save all of us the trouble and just build an API for the existing data! Wouldn't that be groovy?


  1. Nice post! Are you using asynchronous URLFetches for fetching the data? While it won't decrease your CPU usage, it will be substantially faster (in wall clock time) than serial fetches.

  2. Thanks for the post.I'd like to provide a similar service in Washington DC but the official posted times are way off.

  3. Another cache layer to consider is the a cache in the request object. I have started caching a lot of things in the request object as my first level cache, then memcache, then datastore, If you fetch the same entity more than once during the request you can drop off a lot of memcache get's from your overall latency/time/cycles.

  4. @nick unfortunately, asynchronous fetch calls won't help me in this case. the only thing the handler does is parse the content of the fetch. :(It may help with an alternate implementation, however. Can I make ten asynchronous fetch calls without using a callback architecture?@tom luckily, the Madison Metro posts live data fed in directly from the buses themselves. They just didn't take the time to create XML for it, let alone an API.

  5. @tim caching in the request object is a good point. I do this as well - especially for task queue requests - but I failed to mention this in the post. But I need to experiment with passing even more data in. Thanks!

  6. They will help you, because the point of the asynchronous URLFetch is to make it possible to do multiple concurrent fetches - so you can have a single handler fetch multiple URLs and process them at once. You can do it without callbacks - just call .get_result() on each RPC in sequence.

  7. My point with the fetch issue is that there is currently only one fetch call in the handler. And nothing can happen until I get those results. There is one exception. I could make one of the batched memcache calls before it returns.However, I have another technique for getting my hands on the data that would benefit greatly from this technique. I'm going to explore that.

  8. Right, but with asynchronous URLFetch, you could process several URLs in the same handler.

  9. Awesome example on Google App Engine. I just found this site. This is diamond!