Link to wealthfront.com

Fork me on GitHub

Friday, March 18, 2011

Ode to Textual Identifiers

Entities of a domain model are distinct from all other objects in their longevity. These objects are typically stored in a database, or a similar form of long-term storage, and need portable identifiers which can shared within a system, and beyond. Identifiers are essentially "handles to entities", which can be passed around, in the same way that file descriptors are "file handles", which can be passed around.


Identifiers must, by nature, be exchangeable across systems in a fully language neutral way. They are in your DB, in your web stack, on your webpages being manipulated by JavaScript, in e-mails you send to customers for tracking and more.


Despite their ubiquity, software architects often pay too little attention to the implications of identifiers' structure and their interaction with the rest of the software. They'll use the out-of-the-box 32-, or 64-, bit sequence numbers from the database and use an int or long to pass them around. However, there are three key design requirements which need to be met for a successful scheme.

Internal vs. External

The first important use of identifiers is what we call internal: references within the application. In other words, everything that never leaves your system. Things that do not qualify as "internal" include identifiers used in URLs, sent by e-mail or shared with partners for callbacks.


In contrast, external identifiers are those you share and show.


So what's the big deal and why should we care about the distinction? Let's assume the URL for a message in your Gmail inbox was something like

https://mail.google.com/message/5231678
There would be two potential problems

  • Inferring growth: since most numeric identifiers are generated from an increasing sequence, showing these numbers gives a hint to the outside world about the internal growth of the population of various entities. In the case of Gmail above, we could ballpark the number of messages they process in a typical day by sending messages every 10 minutes to an account and then analyzing the identifier assigned to those periodic messages.
  • Anonymity of entities: in some cases, being unable to randomly find an entity is a desirable property. A low key example is yfrog's urls http://yfrog.com/h4ftjnhj used to share pictures. Since yfrog is mostly used in combination with Twitter, the pictures are meant to be broadcasted and so anonymity is probably not an issue… But in the case of Skitch whose sharing model is similar to yfrog's with the exception of being powered by a desktop app, anonymity is much more important. It's what they call a "secret link".
A good solution should not be reverse engineereable and non forgeable.

Typed Identifiers

Our final design concern is going to revolve around identifier confusion. You're looking at some database and see an id "203": is this a user id? an account id? or maybe, a comment id? You just don't know without the proper context. Recognizing identifiers is only the harmless half of the coin. On the flip side, using the wrong identifier when doing operations could be disastrous. Think of doing a money deposit in the wrong bank account because the user id is used instead of the account id!


We've already discussed how to remove this ambiguity in Java using parametrized typed identifier wrapper which look like Id<User>, Id<Account> or Id<Comment>. Unfortunately, it's highly unlikely that you're only dealing with statically typed languages: you'll always have issues revolving around your favorite scripting language like Python, Ruby, Shell, Clojure or simply a tired engineer getting woken up at 6am to patch a db by hand.


Prefixing identifiers by entity specific codes is a simple patch. You'll end up with USE-203, ACT-203 or COM-203 for user, account or comment respectively. Regardless of the system processing those identifiers, if a user id is send in lieu of an account id no harm will ensue.

Putting It All Together

While it is a good idea to have identifiers bigger than 232 ranges, i.e. ints, to avoid hitting a wall at 4M it is unusual to require 264 identifiers. (As a reference point, there are approximately 255 atoms in the world and roughly 284 atoms in the universe.)


What we've settled on are identifiers between 0 and 248-1. As we can see in the diagram, this range is only one of more than 65,000 available on the 264 domain.


We still need to avoid reverse engineering of internal sequencing as well as forgeability. Symmetric-key encryption algorithms and in particular block ciphers neatly solve both problems. By construction, these algorithms provide a one-to-one and onto function which is invertible only if the private key is known.


After selecting a private key, we map the identifiers to the whole 264 range. For instance, one might pad with 16 zeros an identifier and then use a block cipher such as 3DES. (Experts will notice that an an attacker trying about 216 random identifiers would expect to get a forgery!)


Balancing size constraints with convenience, you can encode this in base 36 as 13 characters (case-insensitive) or base 62 which gets you down to 11 characters.

Last step is to add a type prefix depending on the entity being referenced. Identifiers will now look as the following.


The first 3 characters (green) encodes the entity type and the 13 others (blue) are the externalized identifier.


Correctness by design is one of our big mottos at Wealthfront and it sometimes starts with things as small as identifiers. While putting these principles to use in an existing code base can be challenging, we hope you can take advantage of this technique to create safer applications.