From Overnight to Over Lunch: Moving Terabytes of Data At 10 Gigabits per Second

How do you quickly move over 160TB of data between AWS and physical data centers? Is it a good idea to store database backups in the cloud? Today we’re going to explore how the Wealthfront DevOps team was able to reduce a multi-day database restore process to a handful of hours using open source tooling.

How Bare Metal + AWS Gets Us Fast Backups but Slow Restores

Wealthfront runs our production workloads on bare-metal hosted in colocated data centers. We leverage AWS to augment our data centers and give us flexibility where needed. An excellent example of this “augmented bare-metal” approach is our database infrastructure.

For each database cluster we run, we replicate to AWS and take EBS snapshots of that backup replica to get nearly-instant point-in-time backups of our dataset. To restore a backup to a replica in the colo, we must copy the snapshotted dataset from AWS onto a freshly prepared host back in the colo. All colo <-> AWS traffic happens over a 10 Gbps Direct Connect.

A diagram of data flow between a bare-metal colo and AWS. The traffic flows over a 10Gbps Direct Connect.

We knew the restoration process would be slow, for we would have to copy the dataset back into the colo during the restore, but this new backup method was still days faster than a previous backup strategy using mysqldump.

Unfortunately, that slowness came back with a vengeance when, as part of a project to upgrade all database hosts, we had to recreate every instance in our colo. On average each host holds 9 TB of data so we’re looking at 100’s of TB of data we would need to copy from AWS to our colos to spin up a replacement for every host.

To make things even more interesting, we decided to target a 30-day timeline in order to align the final cutover with a previously planned downtime window. But we were confident the team could pull it off!

Rsync: You Need To Start Somewhere

Our very first take at moving the data was to use trusty-ol’ rsync. How bad could it be? Well, pretty terrible. In these initial tests a basic rsync -av /var/lib/mysql/ destination:/var/lib/mysql/ ran for a day and a half before we cut it off. Clearly this was not going to be sufficient!

A graph shows data transfer speeds of 1Gbps when running rsync

A basic rsync runs at around 1GB/s, much less than our target of ~10GB/s.

Some quick math showed we’d be spending most of our 30 days waiting for bits to fly over the wire. If we were able to leverage our entire 10 Gbps Direct Connect we should be able to get the total transfer time down to a day or two.

Testing rsync Flags To Reduce Encryption Bottlenecks

This blog post from kmg.group detailed how they were able to speed up rsync by turning off compression and changing the encryption algorithm to one natively supported by the CPU.

Instead of using rsync -avh you also specify the SSH command to use with a focus on selecting a hardware native encryption algorithm. This changes the rsync command to:

rsync -avh -e "ssh -T -o Compression=no -x -c aes128-gcm@openssh.com" /var/lib/mysql/ destination:/var/lib/mysql
Code language: Bash (bash)

A graph shows data transfer speeds of 1Gbps when running rsync with additional flags

The graph for network throughput with “fast” rsync is basically identical to a normal rsync.

This was an easy hypothesis to test so we gave it a shot and did not see any improvement in throughput. We went back to the drawing board to come up with a new plan. Talking amongst the team we discovered that our packet round-trip-time (RTT) was spiking to 60 milliseconds once they crossed into the AWS network. Because TCP throughput is inversely proportional to RTT and we have no control over the latency introduced by AWS’s internal network, we realized we could not increase the throughput of a single TCP stream. Instead, we chose to focus on scaling our tooling horizontally so we can get more data through.

Parallelizing rsync to Increase Network Throughput

Rather than trying to optimize a single rsync process–a single TCP connection–we decided to run many rsync jobs in parallel. As a first attempt, we generated a list of all files we needed to copy, divided them into groups, then started an rsync process for each group .

This looks something like:

#!/usr/bin/env bash

IFS=$'\n'

num_parallel=10

cd /var/lib/mysql
file_list_dir=$( mktemp -d )

all_files=( $( find . -type f -print ) )
i=0
for f in "${all_files[@]}"; do
	index=$(( i % num_parallel ))
	echo "$f" >> "${file_list_dir}/$index.lst"
	i=$(( i + 1 ))
done

for (( i=0; i<num_parallel; i++)); do
	rsync -avh -e "ssh -T -o Compression=no -x -c aes128-gcm@openssh.com" --files-from="${file_list_dir}/$i.lst" "$PWD/" "destination:$PWD/" &
done

wait
Code language: Bash (bash)

As a proof of concept this worked: we were able to get much higher peak throughput!

A graph shows data transfer speeds starting at 8 Gbps but quickly tapering down to 2 Gbps and eventually 1 Gbps.

The initial high spike is when all the rsync processes are running. Once the processes with the smallest files finish, network utilization diminishes greatly.

In a perfect world, the files would all be the same size so dividing them between processes would result in each process copying the same amount of data. In practice, database files can vary greatly in size because a single InnoDB .ibd file maps to a single database table.

Practically speaking, this means that the files were not evenly divided between rsync processes so some of the processes were finishing much sooner than others. This led to a long-tail of waiting for that final rsync process to finish up. To try to mitigate this, we sorted the files by size before grouping them. This led to a slightly more even distribution of utilization and got us to a place where we had our first working MVP.

A graph shows data transfer speeds starting at 8 Gbps but shortly after tapering down to 1 Gbps

The initial spike is slightly more sustained, but we are still stuck with a long-tail of large files.

Unfortunately, it still took at least 8 hours to copy over a single instance. This is “good enough” for a one-off job that we can kick off overnight but unsustainable long-term.

With so many instances to copy, we continued to look for ways to speed up the data copy. Because the shape of our data meant there were a few giant files and many tiny files we would always be as slow as the largest file. If we could find a tool that would break up the files before sending them across parallel connections then we would have a much higher average throughput.

MSCP: Multi-Threaded SCP

While searching for options to multi-thread file copies we came upon mscp which is a parallelized SCP implementation. The mscp tool distinguishes itself from other tools in several ways:

It handles file chunking so large files can be parallelized
It uses parallelized TCP streams via many SSH connections to let you maximize throughput
It supports checkpointing so that an interrupted transfer can be resumed rather than restarted

These three features were exactly what we were looking for! We swapped out the rsync commands for a single mscp. This not only simplified the code but also led to significant improvements in overall bandwidth. This is the kind of throughput we were expecting!

A graph shows sustained throughput of 8Gbps or more.

Consistent utilization of the network link is a sign that we’re running as optimal as possible.

With this approach we are now able to restore a new database replica from scratch in around 3 hours!

Takeaways and Future Work

The process of exploring solutions and iterative improvement is celebrated within the Wealthfront Engineering organization. While we strive to discover roadblocks early and explore solutions, we constantly pursue automated solutions that decrease engineering toil which comes from significant manual setup. In light of that, we plan to integrate mscp into our automated database backup suite so that future work, such as a mock disaster recovery, benefits from these learnings.

In addition, this is a good example of how Wealthfront’s focus on proportional investment leads to a better engineering experience. We identified significant toil in the original data movement scenario and realized that if we invested time in solving the problem that causes the toil we could be free of it forever. This mindset allowed us to take time in the overall database migration project so that we could find sustainable long-term solutions. Ultimately we were able to leverage that solution to a successful migration and were ready for the database cutover with several days to spare!

Disclosures

The information contained in this communication is provided for general informational purposes only, and should not be construed as investment or tax advice. Nothing in this communication should be construed as a solicitation or offer, or recommendation, of any security or investment strategy. Any links provided to other server sites are offered as a matter of convenience and are not intended to imply that Wealthfront Corporation (“Wealthfront”) or its affiliates endorses, sponsors, promotes and/or is affiliated with the owners of or participants in those sites, or endorses any information contained on those sites, unless expressly stated otherwise.

Investment management and advisory services are provided by Wealthfront Advisers LLC (“Wealthfront Advisers”), an SEC-registered investment adviser. Brokerage related products are provided by Wealthfront Brokerage LLC (“Wealthfront Brokerage”) member FINRA/SIPC and not a bank. Financial planning tools are provided by Wealthfront Software LLC (“Wealthfront Software”).

All investing involves risk, including the possible loss of money you invest, and past performance does not guarantee future performance. Please see our Full Disclosure for important details.

Wealthfront Advisers, Wealthfront Brokerage and Wealthfront Software are wholly owned subsidiaries of Wealthfront.

Engineering Blog – Wealthfront