To normalise or denormalise, that is the question
As noted in earlier posts, Google’s App Engine (GAE) doesn’t use a RDBMS database to power is persistent layer. In fact it’s one huge Bigtable that holds data for the ‘whole world’ and my part of it is 0.0000001% or less, probably.
Because Bigtable is not RDBMS, it doesn’t fit the normal RDBMS way of working. In particular it doesn’t like table joins and is particularly ‘troubled’ by many to many relationships that are the ‘bread and butter’ of RDBMS. This requires a different way of thinking about the data and perhaps, changing tooling to suit, too.
Java persistance architecture
Up to now we’ve been using a JPA framework on top of Bigtable and this did help us get ‘off the ground’ initially. GAE’s implementation is a ‘bastardised’ version from Datanucleus.
However, the many-to-many mapping we had used in the traditional RDBMS was very expensive to run, primarily due to it being broken down to three queries (no joins, remember). Now, in hindsight, we could’ve made some changes to the table structures to improve this, but instead we looked at the alternatives to JPA.
There’s a whole bunch of alternatives that can be used to manage the data, Google even supply a JDO version, again using Datanucleus’ code. (As an aside, my first foray into Java Object based DBMS management was with JDO at Honda). However, we’re after something that is specifically written for GAE data to, hopefully, have better performance and still keep all the useability of JPA, which I quite like.
The two main contenders are Objectify and Twig. It was a real toss-up to decide which one to go for, as neither offered any real advantages over the other that we could see. They have different philosophies, but we are not too concerned over that. What was important is that we didn’t want to have to try each one to see which was best (yeah, stupid sounding, but we’re talking frameworks on top of the same database, not some investment in the actual DB itself.)
After reading up on the frameworks and enjoying the lively discussions between the developers, we went with Objectify. Why? Only because this seemed to have more activity and development. There’s a very useful blog here that offers some comparison of the choices.
What would have really swung it is if one of the sites showed *how* to best implement a traditional RDBMS with many-to-many mappings in their respective frameworks. A clear example of transitioning from one to the other would have been ideal. I appreciate that this is tricky, as there are many ways to implement this, but it would have been nice. By extension, examples of how to query against these new structures would have been good too.
Having converted over to Objectify, how did it go?
From a code point of view it is a lot simpler than JPA. None of this detached merging and persisting. Just store the data and it knows whether to insert or update. Nice.
Queries took a little working out, as they are filter based, almost like chaining results through filters to narrow the search down (perhaps that *is* what it does).
Our table structures changed. Rather than have a many-to-many mapping table to hold events and users, we changed to having a list of events stored with each user. There are some restrictions on this, such as no one list can have more than 5000 entities, and the index explosion can be great, but it seems OK for our limited data. In reflection, we could have done this with the JPA code, although querying this structure would have been difficult in JPA.
Performance-wise we’ve reduce our cost of querying for users in the same events by a factor of 10. Not insignificant when this happens every minute for every user and you pay for server costs (eventually).
The beauty of this change is that it will be seamless from a user perspective. We can just roll a new GE version and make this the default. All out RESTlet mappings point to the new Server Resources and we’re good to go.
Long Term Usage
We’re still concerned about the lack of a true IN feature. These are emulated by running a query for each term in the IN statement. This is very expensive to run and hard to predict how much it will cost as the IN terms are variable.
If Mapcial takes off, then I think we’ll need to look at a different hosting solution that uses a more traditional RDBMS as the Bigtable implementation and it’s unpredictable costs will bankrupt us.
Anyone know a good, free (or cheap) alternative?
The Mapcial and Walk2012 versions are functionally identical, with Walk2012 having some additional branding and icon changes. You can now pick up either version from a link on the original annoucement post.
Feel free to leave comments. Did I mention it’s virtually beta, hasn’t crashed for a while and totally awesome? Cheesynick is using it as part of his London 2 Brighton run 😀