I found myself often having to make serialization design choices to persist financial data in database. The kind of financial data I’m referring to are domain objects describing stocks, stock quotes, stock fundamentals, corporate actions and so on. For example, a typical stock object has about 20 ~ 30 fields describing its static information: country, exchange, currency, security type, lot size, etc.; an end-of-day stock quote object has about 10 fields representing its daily open, close, high, low, ask, bid and volume.
At the beginning I used to use columns to store the fields. But I soon felt the pain: it’s a lot of work to manage the columns. For example, a stock fundamental object easily involves >100 metrics to measure a company’s financial results. It’s a lot of work to manage all those metrics/columns. And also, more and more new metrics are needed to empower an analytical model. To do that I have to create even more columns, which is pretty rigid.
Later on I started to use JSON to serialize objects in database. Only one text column is needed to store all the fields in an JSON object. If I need to add more fields I just add them in the code and no database operation is needed. It’s quite handy.
Even later on I started to use protocol buffers to serialize objects in a single blob column in database. It has the same benefit of adding new fields without adding new columns. The added benefit is that less code needs to be written to represent domain objects: I just need to specify the names and types of the fields. It will automatically generate a class with almost everything I need: getter, setter, builder, and even a CSV-type toString method which I use to dump data into spreadsheets.
But…still I run into difficulties. I often need to give prompt responses to teammates or clients regarding stock data. For example, which country is this company located? which exchange is this stock traded? what trailing P/E ratio is this stock traded as of some date? I can’t quickly answer those questions by looking at the database, because the data are all binary in protocol buffers! Also, I frequently need to do ad hoc analysis on specific type of stocks. For example, all the stocks whose security type is ADR and traded on NYSE, or all the stocks whose trailing revenue growth rate >= 15%. I would have been able to do a simple SQL query to retrieve those stocks if the fields were stored in columns. But now I have to write java code to parse the protocol buffers or JSONs to do the simple task !
Balancing all the pros and cons, I’m currently leaning towards using JSON + denormalizing important fields in columns as my way of serialization, because —
- Denormalizing important fields in columns enables quick SQL queries
- Using JSON gives the flexibility of adding new fields without the need of adding new columns
- Yeah using JSON I will have to write more code compared to using protocol buffer…but given the fact that writing the code is only an one-time sunk cost thing, the perpetual benefit of data readability to human eyes adds a lot of value for quick operational response. Not to mention I really hate losing the intuitive aspect of looking and thinking.