Automating Data Quality Checks on External Data

Written by Emily Shiue and Melissa Chen

The code was simple: query the third party API, if the “ModifiedAt” date returned was more recent than ours, send a notification to our clients. We had all the unit tests, integration tests, and monitoring in place. It had been running smoothly in production for months. So how did we end up unnecessarily sending a notification to our clients that their statements were updated when they weren’t?

At Wealthfront, we rely on third party data for some of our most important business functions such as retrieving our clients’ linked bank account transactions or even generating tax documents. While there is inherent trust that we can rely on their services, we need to ensure our systems are designed to be resilient to any data inconsistencies.

Over the years, we’ve developed a methodology that allows us to validate and monitor third party data to match our expectations. We will go over this methodology, how it can be applied, and trade-offs to consider when using it.

Methodology

At Wealthfront, we build systems with an emphasis on automation. This is fundamental to Wealthfront’s success, but it means that external parties can cause disruptions to our systems if their data is inconsistent with what we expect.

Generally our data quality check methodology helps when:

Data is not in the format we expect
Data is from a manually inputted process
Content of data is illogical or can be misinterpreted
Data is entirely missing, or
The meaning of the data we receive has changed

Our methodology consists of 3 steps:

Persist the raw data

At this point, no validations or effort to cleanse this data has been made — it allows us to debug any issues later and allows for us to identify whether any issues were a result of the data we received or something internal with our parsers or processors. Files, API responses, or any consumable data we receive is stored in our systems. An example of data that we may receive looks like:

Transformation and validation

Oftentimes, our third parties have an additional set of fields that are not relevant for our use case, so after persisting the raw data, we transform it to an internal schema and run validation on that. This allows us to decouple the vendor specific schema from our internal assumptions about the data model. Two classes of errors are detected in this step: unexpected data schema and errors in business logic processing. Looking at the above example data, it could look something like this when transformed into something our systems would recognize:

We have found that the best way of validating data is listing out the invariants that are expected with the third party, and codifying them into assertions. With this Deposit object, we could run validations to assert:

If superfluous data was present, it would have been ignored when creating the Deposit object. If expected data is missing, this is the point in which we should alert or throw an exception so that we can go directly to the source of error.

Alerting and handling on failed validation

When possible, we want to be able to handle the error gracefully and be alerted only when manual intervention is required. If we can use other parts of the data to deduce the expected value, we often fall back to that approach. Otherwise, we should be escalating immediately to our third party to fix the issue. If there is nothing we can do on our end, we like to automate the escalation process by automating emails or alerts to our third party so they can start working on it directly.

Here we can see that the gross amount looks wildly off from what it should be which is quantity x share price = 131.00. If we can validate that the gross amount is unreasonable, we may fall back to deriving the value from quantity x share price instead of using the gross amount directly.

Applying the Methodology

Now let’s put this into practice. Remember the scenario from the beginning? We relied on querying our third party partner’s API for the modification date to determine whether the data had been updated. There was no additional documentation that this date would be updated for any other purpose.

However, we later found out that the metadata field “ModifiedAt” date included changes to fields within their internal object so even though the “ModifiedAt” field was updated, the data we received from the API did not actually change. Our code checked and saw the modified date had changed and assumed the data we received changed as well, leading us to notify our clients that they should expect updated data when nothing actually changed. This led to client confusion and incoming inquiries to our client services team. Though the mistake here was harmless, it highlights how a larger issue with a similar cause could occur and lead to loss of trust with our clients. To prevent future happenings of this error, we evaluated our implementation against our preferred methodology.

Persist the raw data – Prior to this, we were persisting just the metadata and pulling the data directly from the API into our downstream processes. Though we had the capacity to store all the additional data, it didn’t seem necessary since we assumed that we could just fetch the data whenever we needed it. Realizing that it gave us more visibility to be able to see how the data changed over time, we updated the flow to pull the full API response into our systems.
Transformation and Validation – Since not all of the data they were returning to us was relevant, we extracted the data we needed and ran the validation on the new object. What was important to us was whether the data had been updated so we could notify our clients. Instead of relying on their fields to tell us if the date was modified, we now had the fully persisted data and we could now compare clean representations of individual fields directly.
Handling – With this extra level of verification, we could now handle these situations gracefully with high confidence that the data was correct. We would ignore any updates that didn’t contain any relevant changes and notify of updates that had changes. Additional reporting was created to document the exact updates that were made for reviewing and documentation purposes.

Trade-offs

As with any methodology, there are various trade-offs that must be evaluated to see if this works best for the system.

Storage: Data retrieved can be extremely large. Ensure you have the storage capacity to persist the required data
Processing time: Following the series of steps may increase the processing latency
Legal and privacy reasons: Some contracts between clients require that certain data is not stored

Conclusion

With automation being at the forefront of our engineering culture, Wealthfront has developed many paradigms to increase resiliency of our systems and aid in diagnosing issues. This methodology is widely used when consuming data from third parties. It gives us confidence that inconsistencies in the data do not halt all processes. Although the paradigm has its trade-offs, it has worked well for us and we hope that you find it useful too.

If this work interests you or if you have thoughts on how we might make it even better, check out our careers page.

Disclosures

The information contained in this communication is provided for general informational purposes only, and should not be construed as investment or tax advice. Nothing in this communication should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Any links provided to other server sites are offered as a matter of convenience and are not intended to imply that Wealthfront Advisers or its affiliates endorses, sponsors, promotes and/or is affiliated with the owners of or participants in those sites, or endorses any information contained on those sites, unless expressly stated otherwise.

All investing involves risk, including the possible loss of money you invest, and past performance does not guarantee future performance. Please see our Full Disclosure for important details.

Wealthfront offers a free software-based financial advice engine that delivers automated financial planning tools to help users achieve better outcomes. Investment management and advisory services are provided by Wealthfront Advisers LLC, an SEC registered investment adviser, and brokerage related products are provided by Wealthfront Brokerage LLC, a member of FINRA/SIPC.

Wealthfront, Wealthfront Advisers and Wealthfront Brokerage are wholly owned subsidiaries of Wealthfront Corporation.

Engineering Blog – Wealthfront