Link to wealthfront.com

Fork me on GitHub

Tuesday, February 2, 2010

Voldemort in the Wild

At kaChing, we've tried to embraced as much of the lean startup methodology as possible. In keeping with the spirit, we've worked to scale our infrastructure smartly, using data to drive our decisions and discarding speculation. As part of our infrastructure, we've embraced Project Voldemort as a highly performant and reliable data store. One experiment we've been looking into is how the use of Solid State Drives may improve the performance of Voldemort, and perhaps even more importantly, how does that performance compare relative to the cost of the hardware. Before even starting, every indication pointed to SSD providing a significant performance boost in almost every type of benchmark, but we are solely concerned with how SSD performs in our production infrastructure. I realize that's a huge caveat, and I suppose there's plenty of artifacts that affect the performance numbers (purist shudder), but there are plenty of good reasons to use our production systems. First, I don't have to take any machines out of rotation (unused resources are costly). Second, I don't need to do any work to set up the benchmark, I just need to instrument already running services. Third, and probably most importantly, I get performance numbers on actual data(!), not just a contrived benchmark. If I plug in more SSD drives to my infrastructure, I know exactly how it will perform on the data that is most unique and special to me.

Setting up the benchmark was fairly easy. Since the source for Voldemort is available on GitHub, I just cloned the latest version and added some stopwatches using Perf4J. I was primarily concerned with the round-trip times for the standard operations 'get', 'put', 'getAll' from the view of my clients (As an aside, Voldemort makes server-side stats available via JMX if that's your interest). Then, we use a Log4J Appender to forward the stats collected from Perf4J to a central hub. I wrote a nice little parser in Scala (woot!) and generated some charts using JFreeChart.

It also makes sense to talk about the hardware used in the experiment. In general, the machines are similar Linux OSes, running mostly similar services. They're all 32-bit, dual-core processors with 4GB memory. The biggest difference is that half the machines are configured with the Physical Address Extension feature while the other are not. Machines with PAE enabled are capable of addressing the full 4GB of memory, while non-PAE machines reserve ~1GB for the kernel.
#14 (SSD)#1#6#8#29#30
3GB4GB (PAE)3GB3GB4GB (PAE)4GB (PAE)
Intel E2140 1.60GHzIntel E2180 2.00GHzIntel E2140 1.60GHzIntel E2140 1.60GHzIntel E2160 1.80GHzIntel E2160 1.80GHz

So, with no further delay, here are the charts and data! For the charts, the left side is the average time in milliseconds between when an operation starts and stops from the client. The timespan is over 5 full stock market trading days. The chart data was captured at 10 minute increments, while the tables show the data rolled up into averages for the entire day. In addition, the daily tables show the number of times the operation was called as the second number in the table cell. The SSD machine is highlighted as a red line so it's clearly distinguishable from the others.

This first set of charts shows stock ticker data that we store in Voldemort. We take fetch stock information from our provider and put the data as protocol buffers into Voldemort, essentially using it as a persistent cache. As you can see in the chart, the process starts ramping up around 6am EST, and is consistent throughout the day until right before the stock market closes at 4pm. The size of the data is roughly about 190 bytes per object.

Stock Ticker / Get


01/1001/1101/1201/1301/14
#14 (SSD)2.10 ms / 125799051.51 ms / 128089482.00 ms / 123722512.10 ms / 127755412.10 ms / 13021540
#14.50 ms / 136392232.86 ms / 140660134.50 ms / 140351174.30 ms / 140147194.30 ms / 14381169
#63.30 ms / 146976282.47 ms / 149743483.50 ms / 149764033.20 ms / 149925603.30 ms / 15297108
#83.30 ms / 142685002.41 ms / 145419383.40 ms / 145434523.10 ms / 145376753.20 ms / 14933763
#293.80 ms / 156380442.58 ms / 159784724.00 ms / 160447113.20 ms / 159629323.70 ms / 16407025
#303.10 ms / 136131292.30 ms / 138441553.30 ms / 139304803.10 ms / 137682933.30 ms / 14126112


Stock Ticker / Put


01/1001/1101/1201/1301/14
#14 (SSD)2.20 ms / 121915061.57 ms / 124589992.10 ms / 123280352.20 ms / 124331882.20 ms / 12712775
#14.70 ms / 132036972.92 ms / 136465334.60 ms / 136207144.50 ms / 135960184.40 ms / 13952884
#63.50 ms / 143132092.53 ms / 146114323.70 ms / 145767073.30 ms / 145908333.50 ms / 14941805
#83.40 ms / 139690252.47 ms / 142683553.50 ms / 141896133.20 ms / 142295973.40 ms / 14575123
#294.20 ms / 156523142.75 ms / 159669164.30 ms / 159294933.40 ms / 159523554.00 ms / 16369449
#303.60 ms / 148727502.58 ms / 151603713.80 ms / 152132953.60 ms / 150759203.90 ms / 15508978


Stock Ticker / GetAll


01/1001/1101/1201/1301/14
#14 (SSD)1.40 ms / 1600112.79 ms / 1650532.10 ms / 1792041.20 ms / 2308472.50 ms / 284638
#19.20 ms / 1686224.37 ms / 1752075.80 ms / 1905915.20 ms / 2526075.80 ms / 310047
#62.70 ms / 1657984.37 ms / 1714714.80 ms / 1853572.50 ms / 2374985.10 ms / 293108
#82.50 ms / 1593004.50 ms / 1641134.90 ms / 1779452.20 ms / 2299313.90 ms / 286053
#293.30 ms / 1692873.51 ms / 1744953.70 ms / 1887623.30 ms / 2517964.90 ms / 309301
#3010.50 ms / 1644919.78 ms / 17008311.70 ms / 18378010.90 ms / 24021411.70 ms / 294018


The next data set is generated by a batch processing job that calculates portfolio performance from the day's market data. It's scheduled to start after the market closes and runs for a few hours to completion. It's represented as a protobuf and the size of the data is roughly about 500 bytes per object. The daily values in the table are a little skewed (read: messed up) because the snapshot is taken at 16:00 EST, which is right near some of the calculation and seems to be miss some of the data. The charts are unaffected, since they're taken every 10 minutes.

Data Crunching / Get


01/1001/1101/1201/1301/14
#14 (SSD)6.40 ms / 1450213.13 ms / 2493268.70 ms / 25818711.90 ms / 1345312.30 ms / 302971
#110.90 ms / 1776515.31 ms / 27092513.30 ms / 27996820.30 ms / 1488320.80 ms / 328361
#611.60 ms / 1683913.60 ms / 28828711.30 ms / 29803215.90 ms / 1548718.00 ms / 349492
#811.30 ms / 1721713.68 ms / 28138611.20 ms / 29032519.40 ms / 1553520.00 ms / 340501
#295.60 ms / 1982412.15 ms / 3046639.50 ms / 31492017.90 ms / 1647716.80 ms / 368730
#3010.10 ms / 1820014.88 ms / 26822012.70 ms / 27602418.80 ms / 1529922.60 ms / 321462


Data Crunching / Put


01/1001/1101/1201/1301/14
#14 (SSD)10.00 ms / 553012.79 ms / 1779818.30 ms / 18366015.00 ms / 743512.20 ms / 216806
#114.30 ms / 657514.91 ms / 19612512.70 ms / 20234525.00 ms / 822520.80 ms / 238778
#612.90 ms / 674612.69 ms / 20187510.10 ms / 20850619.00 ms / 853617.10 ms / 245850
#815.20 ms / 703512.68 ms / 20717910.10 ms / 21360126.50 ms / 877818.40 ms / 251763
#299.30 ms / 732111.34 ms / 2169558.90 ms / 22418321.40 ms / 913516.40 ms / 264182
#3011.70 ms / 726013.77 ms / 21595811.50 ms / 22213027.40 ms / 911521.30 ms / 260285



Data Crunching / GetAll


01/1001/1101/1201/1301/14
#14 (SSD)0.60 ms / 36821.00 ms / 41530.70 ms / 49890.50 ms / 27390.70 ms / 6381
#110.40 ms / 38726.62 ms / 43268.70 ms / 50778.80 ms / 27767.00 ms / 5278
#68.10 ms / 40318.64 ms / 461710.70 ms / 56429.20 ms / 30528.60 ms / 6120
#87.20 ms / 44706.97 ms / 52399.70 ms / 57848.60 ms / 34247.40 ms / 6473
#293.30 ms / 44235.13 ms / 50776.20 ms / 56876.60 ms / 34285.50 ms / 5923
#3014.80 ms / 410312.08 ms / 511420.10 ms / 547421.20 ms / 321917.40 ms / 6352


This last set of charts is also a portfolio analytics calculation that runs as a batch processing job and uses voldemort as a persistent intermediary. It's also stored as a protobuf and the object size is about 8-9K per object. Of the three stores, this one is the least interesting, as it's only used for a short period of time during the day, and is untouched during the rest of the day. I also believe the sample size is somewhat small, but of all the experiments, the SSD performs the worst on this data set. Also, due to the timing of the daily cut off and the low number of calls, I'm not going to bother including the table.

Gains / Get



Gains / Put



Gains / GetAll



Some Conclusions


Clearly, the performance of the solid state drive is better than our our other stores backed by traditional drives. One other interesting thing we notice in the data is that our machines without the Physical Address Extension seem to outperform the machines with the extension. Some of the reasoning may be that PAE adds an additional level of indirection required for memory operations, but I'd be interested in whatever thoughts others have on the topic. It's a little tough to see from the charts, since I was trying to highlight the SSD), but hopefully in a post down the line, I'll emphasize that difference. Also, since we collected data for this experiment, we've also added a 64-bit machine into the rotation, so it should be interesting to see what kind of results we see from that machine.

Finally, special thanks to Andrew Schwabecher and Will over at Central Host for helping us out! Also, for anyone who's interested in seeing what changes I made to voldemort to perform the benchmarking, check out my branch of Voldmort. (Try to ignore the hack I have in there for using Voldemort with passing in the configuration instead of reading the config from disk; it's unrelated to the profiling)