The AppEngine Bulk Loader

So, something I've spent a lot of time doing recently, is waiting for the AppEngine Bulk Loader to finish. At the moment, there's nothing in the admin console to bulk add, edit or delete entries in your Datastore — you can create or edit entities singly, and you can tick off and delete up to 20 at a time, but that's your lot.

The officially-sanctioned bodge for this is the Bulk Loader. Though not well-documented, this is quite a powerful (if slow) tool.

The gist of it, as shown in the article I just linked to, is that you create a CSV file containing your data, then pass it to the bulkload_client.py script. For this to work, you need to set up a handler on the server, that takes the CSV fields and turns them into a Datastore entity.

To do this, your script needs to call bulkload.main() passing one object for each entity kind (db.Model) it supports. For instance, if you have Album and Track entities, you'll need something like:

python
boilerplate
if __name__=="__main__": bulkload.main(AlbumLoader(), TrackLoader())

These are subclasses of bulkload.Loader — although, in fact, they don't need to be subclasses, you can just create instances of bulkload.Loader and pass the parameters directly into the constructor. The parameters are, firstly, the name of the entity-kind it handles, and then a list of tuples, each one containing a property name, and a type (as a Python or Datastore type, not a property type — eg str not db.StringProperty()).

That's all, at least, in the simple case, and it's pretty straightforward.

Cookie Monster

The first undocumented problem you'll run into is security. If you have, sensibly, restricted your Bulk Loader to admin only, you'll find there's no way to supply your admin password to the bulkload_client.py script. Instead you need to supply a very, very long cookie. How to get this?

Log into your application in a web browser, as an admin, and go to the URL your handler lives at. When the you GET rather than POST to it, it will tell you the right cookie to use.

Character Set of Doom

If you have UTF-8 data, the Bulk Loader in the SDK chokes. This is because the URL handler seems to decode the UTF-8 to Python's native unicode before passing it on to the loader, but BulkLoad.Load expects it to be still encoded (more precisely, Python's CSV module expects a str, and given a unicode will decode it using the ascii codec, which almost certainly not what you want).

Hopefully they'll fix this soon, but a temporary work-around is to use your own subclass of bulkload.BulkLoad that re-encodes the data, something like:

python
Bulk Loader UTF-8 Workaround
class BulkLoad(bulkload.BulkLoad): def Load(self, kind, data): return bulkload.BulkLoad.Load(self, kind, data.encode('utf-8'))

To direct the loader at this version, you'll probably need to copy-and-paste bulkload.main into your own module and edit it — I haven't spotted a more convenient way yet. Still, hopefully this will all be unnecessary soon.

(Update 2008-07-15) Vincent Isambart reminds me to point out you must tell the Bulk Loader that you're using utf-8, when you initialise it. I define a function to do the conversion:

python
utf-8 conversion
def utf8string(s): return unicode(s, 'utf-8')

...and then use this instead of str or unicode in the constructor:

python
utf-8 specified in the constructor
class UserLoader(bulkload.Loader): def __init__(self): bulkload.Loader.__init__(self, 'User', [ ('name', utf8string), ])

The Mysterious HandleEntity

The article gives an example of using HandleEntity but doesn't really explain how to use it. It's a very useful optional-extra. Essentially, it receives an entity definition, and can modify it in any way it likes before returning it. There are several interesting things to note here:

  • The entity is passed as a datastore.Entity, not a db.Model instance. This is just a Python dictionary subclass with some extra bits tacked on; read and write using normal Python dictionary['item'] notation if you need to modify the data before it gets written to the Datastore.
  • If you need to make one, you construct it by passing the model kind's name as a string, and can optionally supply a name (its new key name) and a parent (must be a Key.)
  • The main reason why you might want to make one, is that you can return any number of new entities from HandleEntity, so a single CSV row can spawn all sorts of data.
  • Perhaps more significantly, "any number" includes zero. So you can actually use the Bulk Loader as a Bulk Anythinger. If you return no entities, nothing is created, the Datastore is untouched, so the model kind you supplied right at the beginning is irrelevant. You can treat each loader as an arbitrary function which gets called repeatedly, with the CSV fields as parameters.

For instance, here's a Bulk Deleter, if you can get a list of keys into a CSV file:

python
Bulk Deleter
class BulkDeleter(bulkload.Loader): def __init__(self): bulkload.Loader.__init__(self, 'BulkDelete', [ ('key', str), ]) def HandleEntity(self, entity): key = entity['key'] entity = db.get(db.Key(key)) if entity: entity.delete() return []

Other Problems

This is all very well, but unfortunately it's slow and unreliable. Maybe I'm just unlucky but I've been seeing a lot of 502 Bad Gateway and 500 Server Error messages that interrupt the loader. One of the parameters to bulkload_client.py allows you to specify the number of CSV rows sent in each packet, and lowering this can help sometimes, but it seems a bit arbitrary to be honest. And each time, you have to start your bulk load from scratch.

Well, you did — I submitted a patch to the client script that allows you to skip a bunch of rows at the beginning of the file, so you can easily resume an interrupted bulk-load. It also tells you how many rows it reached before failing, so you know where to resume from. You'll want to make sure your task is idempotent as you can't always be sure precisely where the failure error occurred. You'll want to check for the existence of objects before adding them, unless you're specifying their key exactly, in which case the new entity will simply overwrite the old one when you put() it (thanks to Nick Johnson for highlighting the put() behaviour here).

Hope it's of some use! Actually... no, wait. I hope it's useless, because that would mean the Bulk Loader wasn't falling over. :) Either way, hope you get your data loaded up without hassle!