Voldemort in the Wild

At kaChing, we’ve tried to embraced as much of the lean startup methodology as possible. In keeping with the spirit, we’ve worked to scale our infrastructure smartly, using data to drive our decisions and discarding speculation. As part of our infrastructure, we’ve embraced Project Voldemort as a highly performant and reliable data store. One experiment we’ve been looking into is how the use of Solid State Drives may improve the performance of Voldemort, and perhaps even more importantly, how does that performance compare relative to the cost of the hardware. Before even starting, every indication pointed to SSD providing a significant performance boost in almost every type of benchmark, but we are solely concerned with how SSD performs in our production infrastructure. I realize that’s a huge caveat, and I suppose there’s plenty of artifacts that affect the performance numbers (purist shudder), but there are plenty of good reasons to use our production systems. First, I don’t have to take any machines out of rotation (unused resources are costly). Second, I don’t need to do any work to set up the benchmark, I just need to instrument already running services. Third, and probably most importantly, I get performance numbers on actual data(!), not just a contrived benchmark. If I plug in more SSD drives to my infrastructure, I know exactly how it will perform on the data that is most unique and special to me.

Setting up the benchmark was fairly easy. Since the source for Voldemort is available on GitHub, I just cloned the latest version and added some stopwatches using Perf4J. I was primarily concerned with the round-trip times for the standard operations ‘get’, ‘put’, ‘getAll’ from the view of my clients (As an aside, Voldemort makes server-side stats available via JMX if that’s your interest). Then, we use a Log4J Appender to forward the stats collected from Perf4J to a central hub. I wrote a nice little parser in Scala (woot!) and generated some charts using JFreeChart.

It also makes sense to talk about the hardware used in the experiment. In general, the machines are similar Linux OSes, running mostly similar services. They’re all 32-bit, dual-core processors with 4GB memory. The biggest difference is that half the machines are configured with the Physical Address Extension feature while the other are not. Machines with PAE enabled are capable of addressing the full 4GB of memory, while non-PAE machines reserve ~1GB for the kernel.

#14 (SSD) #1 #6 #8 #29 #30
3GB 4GB (PAE) 3GB 3GB 4GB (PAE) 4GB (PAE)
Intel E2140 1.60GHz Intel E2180 2.00GHz Intel E2140 1.60GHz Intel E2140 1.60GHz Intel E2160 1.80GHz Intel E2160 1.80GHz

So, with no further delay, here are the charts and data! For the charts, the left side is the average time in milliseconds between when an operation starts and stops from the client. The timespan is over 5 full stock market trading days. The chart data was captured at 10 minute increments, while the tables show the data rolled up into averages for the entire day. In addition, the daily tables show the number of times the operation was called as the second number in the table cell. The SSD machine is highlighted as a red line so it’s clearly distinguishable from the others.

This first set of charts shows stock ticker data that we store in Voldemort. We take fetch stock information from our provider and put the data as protocol buffers into Voldemort, essentially using it as a persistent cache. As you can see in the chart, the process starts ramping up around 6am EST, and is consistent throughout the day until right before the stock market closes at 4pm. The size of the data is roughly about 190 bytes per object.

Stock Ticker / Get

01/10 01/11 01/12 01/13 01/14
#14 (SSD) 2.10 ms / 12579905 1.51 ms / 12808948 2.00 ms / 12372251 2.10 ms / 12775541 2.10 ms / 13021540
#1 4.50 ms / 13639223 2.86 ms / 14066013 4.50 ms / 14035117 4.30 ms / 14014719 4.30 ms / 14381169
#6 3.30 ms / 14697628 2.47 ms / 14974348 3.50 ms / 14976403 3.20 ms / 14992560 3.30 ms / 15297108
#8 3.30 ms / 14268500 2.41 ms / 14541938 3.40 ms / 14543452 3.10 ms / 14537675 3.20 ms / 14933763
#29 3.80 ms / 15638044 2.58 ms / 15978472 4.00 ms / 16044711 3.20 ms / 15962932 3.70 ms / 16407025
#30 3.10 ms / 13613129 2.30 ms / 13844155 3.30 ms / 13930480 3.10 ms / 13768293 3.30 ms / 14126112

 

Stock Ticker / Put

01/10 01/11 01/12 01/13 01/14
#14 (SSD) 2.20 ms / 12191506 1.57 ms / 12458999 2.10 ms / 12328035 2.20 ms / 12433188 2.20 ms / 12712775
#1 4.70 ms / 13203697 2.92 ms / 13646533 4.60 ms / 13620714 4.50 ms / 13596018 4.40 ms / 13952884
#6 3.50 ms / 14313209 2.53 ms / 14611432 3.70 ms / 14576707 3.30 ms / 14590833 3.50 ms / 14941805
#8 3.40 ms / 13969025 2.47 ms / 14268355 3.50 ms / 14189613 3.20 ms / 14229597 3.40 ms / 14575123
#29 4.20 ms / 15652314 2.75 ms / 15966916 4.30 ms / 15929493 3.40 ms / 15952355 4.00 ms / 16369449
#30 3.60 ms / 14872750 2.58 ms / 15160371 3.80 ms / 15213295 3.60 ms / 15075920 3.90 ms / 15508978

 

Stock Ticker / GetAll

01/10 01/11 01/12 01/13 01/14
#14 (SSD) 1.40 ms / 160011 2.79 ms / 165053 2.10 ms / 179204 1.20 ms / 230847 2.50 ms / 284638
#1 9.20 ms / 168622 4.37 ms / 175207 5.80 ms / 190591 5.20 ms / 252607 5.80 ms / 310047
#6 2.70 ms / 165798 4.37 ms / 171471 4.80 ms / 185357 2.50 ms / 237498 5.10 ms / 293108
#8 2.50 ms / 159300 4.50 ms / 164113 4.90 ms / 177945 2.20 ms / 229931 3.90 ms / 286053
#29 3.30 ms / 169287 3.51 ms / 174495 3.70 ms / 188762 3.30 ms / 251796 4.90 ms / 309301
#30 10.50 ms / 164491 9.78 ms / 170083 11.70 ms / 183780 10.90 ms / 240214 11.70 ms / 294018

The next data set is generated by a batch processing job that calculates portfolio performance from the day’s market data. It’s scheduled to start after the market closes and runs for a few hours to completion. It’s represented as a protobuf and the size of the data is roughly about 500 bytes per object. The daily values in the table are a little skewed (read: messed up) because the snapshot is taken at 16:00 EST, which is right near some of the calculation and seems to be miss some of the data. The charts are unaffected, since they’re taken every 10 minutes.

Data Crunching / Get

01/10 01/11 01/12 01/13 01/14
#14 (SSD) 6.40 ms / 14502 13.13 ms / 249326 8.70 ms / 258187 11.90 ms / 13453 12.30 ms / 302971
#1 10.90 ms / 17765 15.31 ms / 270925 13.30 ms / 279968 20.30 ms / 14883 20.80 ms / 328361
#6 11.60 ms / 16839 13.60 ms / 288287 11.30 ms / 298032 15.90 ms / 15487 18.00 ms / 349492
#8 11.30 ms / 17217 13.68 ms / 281386 11.20 ms / 290325 19.40 ms / 15535 20.00 ms / 340501
#29 5.60 ms / 19824 12.15 ms / 304663 9.50 ms / 314920 17.90 ms / 16477 16.80 ms / 368730
#30 10.10 ms / 18200 14.88 ms / 268220 12.70 ms / 276024 18.80 ms / 15299 22.60 ms / 321462

 

Data Crunching / Put

01/10 01/11 01/12 01/13 01/14
#14 (SSD) 10.00 ms / 5530 12.79 ms / 177981 8.30 ms / 183660 15.00 ms / 7435 12.20 ms / 216806
#1 14.30 ms / 6575 14.91 ms / 196125 12.70 ms / 202345 25.00 ms / 8225 20.80 ms / 238778
#6 12.90 ms / 6746 12.69 ms / 201875 10.10 ms / 208506 19.00 ms / 8536 17.10 ms / 245850
#8 15.20 ms / 7035 12.68 ms / 207179 10.10 ms / 213601 26.50 ms / 8778 18.40 ms / 251763
#29 9.30 ms / 7321 11.34 ms / 216955 8.90 ms / 224183 21.40 ms / 9135 16.40 ms / 264182
#30 11.70 ms / 7260 13.77 ms / 215958 11.50 ms / 222130 27.40 ms / 9115 21.30 ms / 260285

 

Data Crunching / GetAll

01/10 01/11 01/12 01/13 01/14
#14 (SSD) 0.60 ms / 3682 1.00 ms / 4153 0.70 ms / 4989 0.50 ms / 2739 0.70 ms / 6381
#1 10.40 ms / 3872 6.62 ms / 4326 8.70 ms / 5077 8.80 ms / 2776 7.00 ms / 5278
#6 8.10 ms / 4031 8.64 ms / 4617 10.70 ms / 5642 9.20 ms / 3052 8.60 ms / 6120
#8 7.20 ms / 4470 6.97 ms / 5239 9.70 ms / 5784 8.60 ms / 3424 7.40 ms / 6473
#29 3.30 ms / 4423 5.13 ms / 5077 6.20 ms / 5687 6.60 ms / 3428 5.50 ms / 5923
#30 14.80 ms / 4103 12.08 ms / 5114 20.10 ms / 5474 21.20 ms / 3219 17.40 ms / 6352

This last set of charts is also a portfolio analytics calculation that runs as a batch processing job and uses voldemort as a persistent intermediary. It’s also stored as a protobuf and the object size is about 8-9K per object. Of the three stores, this one is the least interesting, as it’s only used for a short period of time during the day, and is untouched during the rest of the day. I also believe the sample size is somewhat small, but of all the experiments, the SSD performs the worst on this data set. Also, due to the timing of the daily cut off and the low number of calls, I’m not going to bother including the table.

Gains / Get

Gains / Put

Gains / GetAll

Some Conclusions

Clearly, the performance of the solid state drive is better than our our other stores backed by traditional drives. One other interesting thing we notice in the data is that our machines without the Physical Address Extension seem to outperform the machines with the extension. Some of the reasoning may be that PAE adds an additional level of indirection required for memory operations, but I’d be interested in whatever thoughts others have on the topic. It’s a little tough to see from the charts, since I was trying to highlight the SSD), but hopefully in a post down the line, I’ll emphasize that difference. Also, since we collected data for this experiment, we’ve also added a 64-bit machine into the rotation, so it should be interesting to see what kind of results we see from that machine.

Finally, special thanks to Andrew Schwabecher and Will over at Central Host for helping us out! Also, for anyone who’s interested in seeing what changes I made to voldemort to perform the benchmarking, check out my branch of Voldmort. (Try to ignore the hack I have in there for using Voldemort with passing in the configuration instead of reading the config from disk; it’s unrelated to the profiling)