The kind of system we develop at kaChing is very data intensive. Given our test-driven approach, many problems we encounter in production are caused by issues in data, rather than bugs in code.
Particularly, the system often gets confused by stock symbols. Wait a minute, you might wonder: how come stock symbols are confusing? GOOG is GOOG, and AAPL is AAPL. How could one possibly get confused between GOOG and AAPL? Well, the stock market is actually a lot bigger and confusing than just GOOG and AAPL. For example, if you bought 1000 shares of Waste Management on 2009/03/16 and hold it until today, your brokerage statement probably shows a trade record of “2009/03/16 Buy WM 1000 shares”. This record is inherently confusing because WM refers to Waste Management today, but referred to Washington Mutual on the trade date 2009/03/16. If the system is fed with the trade record, it will have difficulty in mapping the stock. Although standardized non-recycled identifier systems such as ISIN or CUSIP make things easier, we still need to recognize tens of thousands of stocks soley based on symbols. Overloaded symbols like WM, L or STAR were re-used by different companies in the history and are often problematic.
Given the nature of the application, we strive to construct complete, reliable and accurate instrument reference data so that we can correctly identify and manage instruments. Instruments are financial contracts traded on exchanges or OTC markets. They can be common stocks, ETFs, ADRs, futures, options, etc. Instrument reference data are their identifier information (such as symbol, name, CUSIP, ISIN, etc.) and other descriptive data (such as trading venue, industry, country etc.). Faulty data pose big operational risk because we might make mistakes in identifying instruments in trading and portfolio valuation. Thus, it is crucial that we validate the data sourced from market data vendors before we deploy them.
For instrument reference data validation, we look at the problem from two dimensions: scope and time. The scope dimension could be local (i.e. per-instrument) or global (the entire instrument universe). The time dimension refers to either current or historical. The two dimensions produce the following four quadrants:
Current Historical Local (i.e. per instrument) (1) (2) Global (i.e. entire instrument universe) (3) (4)
From quadrant (1) to (4), they are increasingly complex and computationally demanding:
(1) Local & Current validation examples: an instrument should have the correct trading symbol as of today; an instrument should be flagged as not-being-traded as of today if it was delisted already.
(2) Local & Historical validation examples: an instrument should have symbol, ISIN, CUSIP and other identifier information on any date in the history after its IPO date; ISIN and CUSIP should also be unique on any date.
(3) Global & Current validation examples: No symbol/ISIN/CUSIP should be associated with more than one instrument as of today.
(4) Global & Historical validation examples: No symbol/ISIN/CUSIP should be associated with more than one instrument on any date in the history.
As financial data flow from issuing companies/exchanges -> data aggregators/distributors -> end users, mistakes and errors are inevitably introduced as the data move along the data supply chain. It involves an enormous amount of effort on the user end to validate the data by reconciling among multiple data sources and correcting mistakes. At kaChing we spend a lot of time to develop methods and tools for data validation. And we always validate the data before we deploy them to make sure the trading system and portfolio analytics system are fed with accurate data.
To draw an analogy between code testing vs. data validation: Just like we do extensive testing before we deploy code, we do extensive validation before we deploy data.