Voldemort in the Wild

At kaChing, we’ve tried to embraced as much of the lean startup methodology as possible. In keeping with the spirit, we’ve worked to scale our infrastructure smartly, using data to drive our decisions and discarding speculation. As part of our infrastructure, we’ve embraced Project Voldemort as a highly performant and reliable data store. One experiment we’ve been looking into is how the use of Solid State Drives may improve the performance of Voldemort, and perhaps even more importantly, how does that performance compare relative to the cost of the hardware. Before even starting, every indication pointed to SSD providing a significant performance boost in almost every type of benchmark, but we are solely concerned with how SSD performs in our production infrastructure. I realize that’s a huge caveat, and I suppose there’s plenty of artifacts that affect the performance numbers (purist shudder), but there are plenty of good reasons to use our production systems. First, I don’t have to take any machines out of rotation (unused resources are costly). Second, I don’t need to do any work to set up the benchmark, I just need to instrument already running services. Third, and probably most importantly, I get performance numbers on actual data(!), not just a contrived benchmark. If I plug in more SSD drives to my infrastructure, I know exactly how it will perform on the data that is most unique and special to me.

Setting up the benchmark was fairly easy. Since the source for Voldemort is available on GitHub, I just cloned the latest version and added some stopwatches using Perf4J. I was primarily concerned with the round-trip times for the standard operations ‘get’, ‘put’, ‘getAll’ from the view of my clients (As an aside, Voldemort makes server-side stats available via JMX if that’s your interest). Then, we use a Log4J Appender to forward the stats collected from Perf4J to a central hub. I wrote a nice little parser in Scala (woot!) and generated some charts using JFreeChart.

It also makes sense to talk about the hardware used in the experiment. In general, the machines are similar Linux OSes, running mostly similar services. They’re all 32-bit, dual-core processors with 4GB memory. The biggest difference is that half the machines are configured with the Physical Address Extension feature while the other are not. Machines with PAE enabled are capable of addressing the full 4GB of memory, while non-PAE machines reserve ~1GB for the kernel.

#14 (SSD)	#1	#6	#8	#29	#30
3GB	4GB (PAE)	3GB	3GB	4GB (PAE)	4GB (PAE)
Intel E2140 1.60GHz	Intel E2180 2.00GHz	Intel E2140 1.60GHz	Intel E2140 1.60GHz	Intel E2160 1.80GHz	Intel E2160 1.80GHz

So, with no further delay, here are the charts and data! For the charts, the left side is the average time in milliseconds between when an operation starts and stops from the client. The timespan is over 5 full stock market trading days. The chart data was captured at 10 minute increments, while the tables show the data rolled up into averages for the entire day. In addition, the daily tables show the number of times the operation was called as the second number in the table cell. The SSD machine is highlighted as a red line so it’s clearly distinguishable from the others.

This first set of charts shows stock ticker data that we store in Voldemort. We take fetch stock information from our provider and put the data as protocol buffers into Voldemort, essentially using it as a persistent cache. As you can see in the chart, the process starts ramping up around 6am EST, and is consistent throughout the day until right before the stock market closes at 4pm. The size of the data is roughly about 190 bytes per object.

Stock Ticker / Get

	01/10	01/11	01/12	01/13	01/14
#14 (SSD)	2.10 ms / 12579905	1.51 ms / 12808948	2.00 ms / 12372251	2.10 ms / 12775541	2.10 ms / 13021540
#1	4.50 ms / 13639223	2.86 ms / 14066013	4.50 ms / 14035117	4.30 ms / 14014719	4.30 ms / 14381169
#6	3.30 ms / 14697628	2.47 ms / 14974348	3.50 ms / 14976403	3.20 ms / 14992560	3.30 ms / 15297108
#8	3.30 ms / 14268500	2.41 ms / 14541938	3.40 ms / 14543452	3.10 ms / 14537675	3.20 ms / 14933763
#29	3.80 ms / 15638044	2.58 ms / 15978472	4.00 ms / 16044711	3.20 ms / 15962932	3.70 ms / 16407025
#30	3.10 ms / 13613129	2.30 ms / 13844155	3.30 ms / 13930480	3.10 ms / 13768293	3.30 ms / 14126112

Stock Ticker / Put

	01/10	01/11	01/12	01/13	01/14
#14 (SSD)	2.20 ms / 12191506	1.57 ms / 12458999	2.10 ms / 12328035	2.20 ms / 12433188	2.20 ms / 12712775
#1	4.70 ms / 13203697	2.92 ms / 13646533	4.60 ms / 13620714	4.50 ms / 13596018	4.40 ms / 13952884
#6	3.50 ms / 14313209	2.53 ms / 14611432	3.70 ms / 14576707	3.30 ms / 14590833	3.50 ms / 14941805
#8	3.40 ms / 13969025	2.47 ms / 14268355	3.50 ms / 14189613	3.20 ms / 14229597	3.40 ms / 14575123
#29	4.20 ms / 15652314	2.75 ms / 15966916	4.30 ms / 15929493	3.40 ms / 15952355	4.00 ms / 16369449
#30	3.60 ms / 14872750	2.58 ms / 15160371	3.80 ms / 15213295	3.60 ms / 15075920	3.90 ms / 15508978

Stock Ticker / GetAll

	01/10	01/11	01/12	01/13	01/14
#14 (SSD)	1.40 ms / 160011	2.79 ms / 165053	2.10 ms / 179204	1.20 ms / 230847	2.50 ms / 284638
#1	9.20 ms / 168622	4.37 ms / 175207	5.80 ms / 190591	5.20 ms / 252607	5.80 ms / 310047
#6	2.70 ms / 165798	4.37 ms / 171471	4.80 ms / 185357	2.50 ms / 237498	5.10 ms / 293108
#8	2.50 ms / 159300	4.50 ms / 164113	4.90 ms / 177945	2.20 ms / 229931	3.90 ms / 286053
#29	3.30 ms / 169287	3.51 ms / 174495	3.70 ms / 188762	3.30 ms / 251796	4.90 ms / 309301
#30	10.50 ms / 164491	9.78 ms / 170083	11.70 ms / 183780	10.90 ms / 240214	11.70 ms / 294018

The next data set is generated by a batch processing job that calculates portfolio performance from the day’s market data. It’s scheduled to start after the market closes and runs for a few hours to completion. It’s represented as a protobuf and the size of the data is roughly about 500 bytes per object. The daily values in the table are a little skewed (read: messed up) because the snapshot is taken at 16:00 EST, which is right near some of the calculation and seems to be miss some of the data. The charts are unaffected, since they’re taken every 10 minutes.

Data Crunching / Get

	01/10	01/11	01/12	01/13	01/14
#14 (SSD)	6.40 ms / 14502	13.13 ms / 249326	8.70 ms / 258187	11.90 ms / 13453	12.30 ms / 302971
#1	10.90 ms / 17765	15.31 ms / 270925	13.30 ms / 279968	20.30 ms / 14883	20.80 ms / 328361
#6	11.60 ms / 16839	13.60 ms / 288287	11.30 ms / 298032	15.90 ms / 15487	18.00 ms / 349492
#8	11.30 ms / 17217	13.68 ms / 281386	11.20 ms / 290325	19.40 ms / 15535	20.00 ms / 340501
#29	5.60 ms / 19824	12.15 ms / 304663	9.50 ms / 314920	17.90 ms / 16477	16.80 ms / 368730
#30	10.10 ms / 18200	14.88 ms / 268220	12.70 ms / 276024	18.80 ms / 15299	22.60 ms / 321462

Data Crunching / Put

	01/10	01/11	01/12	01/13	01/14
#14 (SSD)	10.00 ms / 5530	12.79 ms / 177981	8.30 ms / 183660	15.00 ms / 7435	12.20 ms / 216806
#1	14.30 ms / 6575	14.91 ms / 196125	12.70 ms / 202345	25.00 ms / 8225	20.80 ms / 238778
#6	12.90 ms / 6746	12.69 ms / 201875	10.10 ms / 208506	19.00 ms / 8536	17.10 ms / 245850
#8	15.20 ms / 7035	12.68 ms / 207179	10.10 ms / 213601	26.50 ms / 8778	18.40 ms / 251763
#29	9.30 ms / 7321	11.34 ms / 216955	8.90 ms / 224183	21.40 ms / 9135	16.40 ms / 264182
#30	11.70 ms / 7260	13.77 ms / 215958	11.50 ms / 222130	27.40 ms / 9115	21.30 ms / 260285

Data Crunching / GetAll

	01/10	01/11	01/12	01/13	01/14
#14 (SSD)	0.60 ms / 3682	1.00 ms / 4153	0.70 ms / 4989	0.50 ms / 2739	0.70 ms / 6381
#1	10.40 ms / 3872	6.62 ms / 4326	8.70 ms / 5077	8.80 ms / 2776	7.00 ms / 5278
#6	8.10 ms / 4031	8.64 ms / 4617	10.70 ms / 5642	9.20 ms / 3052	8.60 ms / 6120
#8	7.20 ms / 4470	6.97 ms / 5239	9.70 ms / 5784	8.60 ms / 3424	7.40 ms / 6473
#29	3.30 ms / 4423	5.13 ms / 5077	6.20 ms / 5687	6.60 ms / 3428	5.50 ms / 5923
#30	14.80 ms / 4103	12.08 ms / 5114	20.10 ms / 5474	21.20 ms / 3219	17.40 ms / 6352

This last set of charts is also a portfolio analytics calculation that runs as a batch processing job and uses voldemort as a persistent intermediary. It’s also stored as a protobuf and the object size is about 8-9K per object. Of the three stores, this one is the least interesting, as it’s only used for a short period of time during the day, and is untouched during the rest of the day. I also believe the sample size is somewhat small, but of all the experiments, the SSD performs the worst on this data set. Also, due to the timing of the daily cut off and the low number of calls, I’m not going to bother including the table.

Gains / Get

Gains / Put

Gains / GetAll

Some Conclusions

Clearly, the performance of the solid state drive is better than our our other stores backed by traditional drives. One other interesting thing we notice in the data is that our machines without the Physical Address Extension seem to outperform the machines with the extension. Some of the reasoning may be that PAE adds an additional level of indirection required for memory operations, but I’d be interested in whatever thoughts others have on the topic. It’s a little tough to see from the charts, since I was trying to highlight the SSD), but hopefully in a post down the line, I’ll emphasize that difference. Also, since we collected data for this experiment, we’ve also added a 64-bit machine into the rotation, so it should be interesting to see what kind of results we see from that machine.

Finally, special thanks to Andrew Schwabecher and Will over at Central Host for helping us out! Also, for anyone who’s interested in seeing what changes I made to voldemort to perform the benchmarking, check out my branch of Voldmort. (Try to ignore the hack I have in there for using Voldemort with passing in the configuration instead of reading the config from disk; it’s unrelated to the profiling)

Engineering Blog – Wealthfront