Wealthfront’s trading system evaluates hundreds of thousands of client accounts daily for tax-loss harvesting and rebalancing opportunities, requiring high throughput to process all accounts within market hours and low latency to minimize price movement while making investment decisions.
In a system with heavy database usage that has low-latency SLA’s such as ours, database queries must be efficient and resilient against performance degradation over time. For instance, an O(n) query might appear fast when the table is small, but as it grows over time, the query’s runtime will grow proportionally. When dealing with queries on single tables, performance can be optimized simply by analyzing used columns and adding an appropriate index. In contrast, optimizing complex joins on multiple tables can be challenging, since most database implementations don’t support multi-table indexes, and since some solutions, such as PostgreSQL’s materialized views, don’t update automatically as underlying tables change. In this blog post, we’ll describe a slow multi-table SQL query encountered by the Wealthfront trading team, and explore two ideas that we considered for improving its performance.
We have two tables,
order_allocations that represent orders for our clients that we send to market. A single order that is sent to the market may be on behalf of many clients, since many clients may wish to trade the same security.
In the above example, we have two orders:
- Buy of 100 shares of VTI, with 40 for client ‘A’ and 60 for client ‘B’
- Sell of 50 shares of VEA, 50 for client ‘A’
The Slow Query
Suppose we want to fetch all orders completed today for client ‘A’. The query would look like:
The issue with this query is that it includes conditions on columns in two separate tables:
orders.completed_at. Since a multi-column index cannot be created across two tables, the best the database can do is use a single index on
order_allocations.client_id to narrow down the rows in
order_allocations with client id ‘A’ before performing the join on
orders. After the join, the database then must filter on
orders.completed_at, which requires iterating through all of the joined rows.
The runtime of this query is O(n), where n is the number of orders the client has placed all-time1. This isn’t great for some of our older accounts that have tens of thousands of orders.
Improvement solution 1 – Storing and indexing on extra columns
One thing to notice is that
order_allocations have a one-to-many relationship. One way to improve the SQL query would be to create a
completed_at column in
order_allocations, which is a copy of the value of orders.completed_at. Then, we can simply add a multi-column index on order_allocations
(client_id, completed_at) and make a small change to our SQL query:
* Remember that the order of the columns specified in the index affects the ordering of that index.
Improvement solution 2 – Using an intermediary query
Another solution uses an intermediary query to narrow down the order ids we need to evaluate.
First we query for an order id from yesterday:
Then we can create a multi-column index on allocations
(client_id, order_id) and modify the SQL query with an additional constraint:
Note that this solution has two assumptions:
- A newer order always has a higher id than an older order (e.g. id is an auto_increment column)
- Orders always complete within the same day. So an order completed yesterday will always have a lower id than one completed today.
By using the additional constraint
order_allocations.client_id = 'A', the database can make use of the new index to filter down order_allocations to only orders completed today and yesterday, before executing the join and applying the
orders.completed_at > date(now()) condition. The improved runtime is O(j), where j is the number of orders completed today or yesterday in the account2.
One further improvement we can make is to cache yesterday’s id and re-use it across all similar queries, thus amortizing the cost of obtaining it.
Choosing between different solutions
As always, selecting the best solution among multiple possible options requires weighing the advantages and disadvantages of each.
In the first solution involving indexing on extra columns, data reads become simpler at the expense of storing more data. The SQL query remains almost unchanged from the original, and it doesn’t depend on the results of other queries or the use of caches. However, the drawback is that it involves additional write overhead to the order_allocations table, which could cause consumers to spend more time acquiring read locks if the orders table is frequently updated. Furthermore, maintaining redundant data and ensuring that it stays in sync with the original data is an additional challenge; out-of-sync data can cause incorrect query results and confusion as to which source is correct.
The second solution with an intermediary query is the opposite; reads get more complex while writes stay the same. Implementing such an approach also imposes risks such as getting your assumptions wrong and erroneously filtering out relevant rows. For instance, we might introduce a new order type that completes over multiple days. This would be handled incorrectly by this solution, as we no longer have a guarantee that an incomplete order always has a larger order id than an order from yesterday.
In our case with our orders query, we opted for the second solution with an intermediary query because:
- We place a large number of orders quickly during market hours, so we need to keep inserts as lightweight as possible.
- Keeping database rows immutable means we don’t have to obtain write locks on specific rows, which reduces the possibility of lock contentions by a producer and consumer.
- No redundant data has to be maintained, which is often a source of toil for when things get out of sync somehow.
By optimizing this one query, we saved 30 hours of CPU time per day! If problems like these interest you, be sure to stop by our careers page and join us in bettering the financial lives of hundreds of thousands of people.
- Technically O(log(m) + n*log(k)) to account for index lookup time, where m is the total size of order_allocations and k is the total size of orders.
- Technically O(log(m) + j*log(k)).
The information contained in this communication is provided for general informational purposes only, and should not be construed as investment or tax advice. Nothing in this communication should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Any links provided to other server sites are offered as a matter of convenience and are not intended to imply that Wealthfront Advisers or its affiliates endorses, sponsors, promotes and/or is affiliated with the owners of or participants in those sites, or endorses any information contained on those sites, unless expressly stated otherwise.
All investing involves risk, including the possible loss of money you invest, and past performance does not guarantee future performance. Please see our Full Disclosure for important details.
Wealthfront offers a free software-based financial advice engine that delivers automated financial planning tools to help users achieve better outcomes. Investment management and advisory services are provided by Wealthfront Advisers LLC, an SEC registered investment adviser, and brokerage related products are provided by Wealthfront Brokerage LLC, a member of FINRA/SIPC.
Wealthfront, Wealthfront Advisers and Wealthfront Brokerage are wholly owned subsidiaries of Wealthfront Corporation.
© 2023 Wealthfront Corporation. All rights reserved.