-Oliver Wendell Holmes, Sr., The Autocrat of the Breakfast Table
At Wealthfront, we manage over $2.5 Billion in assets that we have been trusted with by our clients. Trust and transparency are the foundations of a strong relationship between a financial institution and its clients. The lack of trust in an institution impedes its efficiency and innovation. At Wealthfront, we place a strong premium on establishing and maintaining trust between our clients.
One of the core ways in which we establish trust is by building robust and reliable software products for our clients. The essential component in ensuring high quality software is testing. Without testing, we cannot validate or have confidence in any software we build. This is how we, at Wealthfront, gain trust in our code.
Logic Validation vs. Data Validation
Software can be tested or validated at various levels using a variety of techniques. Two necessary forms of software testing are: (1) logic validation and (2) data validation. Logic validation is the process of determining whether the applied logic (in the form of code) exhibits the desired behaviour. Whereas, data validation is the process of ensuring that the supplied input data to a system is valid and consistent with intended data types of the system. There exists many tools that facilitate and standardize practices for logic validation in the industry. However, the techniques for data validations are particular to the use cases.
At Wealthfront, we practice data validation in our continuous deployment model to ensure that each deployed service is functioning with the new set of data. As described in the linked blog post, the technique employed is customized for the specific use case in question. Another scenario where we use data validation is for verifying the contents of document images we receive from external partners. As these documents are images embedded in PDF, we cannot use standard PDF parsing techniques to validate their content. Rather the technique we use is called optical character recognition (OCR).
Optical Character Recognition (OCR)Optical character recognition (OCR) refers to the automated process of translating images of text into machine-encoded text, such as ASCII. It is widely used in commercial applications to store, edit, search and analyze text documents (typewritten or text). This is done in a matter of seconds which would otherwise be a cumbersome manual task. OCR works by scanning your images, extracting the contained text, splitting the text into characters and then recognizing those characters. It can be trained to recognize a variety of different fonts, languages and even handwritten text. In the open source world, Tesseract is perhaps the most accurate and leading OCR engine. Originally developed as a PhD research project at Hewlett-Packard (HP) in the 1980s, Tesseract has been significantly enhanced by Google after it became open source. At Wealthfront, we use Tesseract to do OCR validation on scanned PDF documents.
Since Tesseract uses Leptonica image processing libraries to perform OCR, it only works with image files such as PNGs or TIFFs and cannot work with PDFs directly. It needs to be combined with a PDF interpreter, such as Ghostscript, an excellent interpreter and manipulator of Postscript and PDF files to image files. To perform OCR in Java code, you need a Java Native Access (JNA) wrapper for simplified native library access to Tesseract OCR engine. Tess4J is the JNA wrapper that combines Tesseract DLLs with Ghostscript to provide feature support for PDF documents.
Following is some sample Java code that takes a scanned PDF document, converts it into PNGs, and then performs OCR using Tess4J libraries:
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
Although Tesseract’s accuracy for interpreting images to text is sufficient and compares well to commercial options, its execution speed is slow. From sample runs, it takes roughly 8-10 seconds to perform OCR on a small pdf document (3-4 pages). The immediate culprit here isn't Tesseract though, it’s Ghostscript. Tess4J’s pdfUtilities internally uses Ghostscript to convert a pdf file to a set of png images.
Ghostscript Performance EnhancementsThere are settings that can be tuned to increase the performance of Ghostscript. If you use the default convertPdf2Png method in Tess4J’s pdfUtilities, then custom settings cannot be exercised. However, you can always write your own wrapper for Ghostscript and calibrate settings to optimize the performance of your program, such as the sample:
Ghostscript suggests using the options for multithreaded rendering (increase the rendering bands for concurrency on multi-core systems) via -dNumRenderingThreads=n or giving it more memory for performance improvements. However, from experimentation results, they offered little to no improvement for our set of input data.
The resolution at which you perform the document conversion does have a direct impact on Ghostscript performance, albeit at the cost of quality of the output image file. While converting documents at lower DPI will reduce the conversion time, they will increase the inaccuracy of the OCR interpretation and vice versa. You can specify the output image resolution with the -rres argument. By default, Ghostscript converts images at 72 DPI which is quite low. Following are the performance results comparison at different DPIs:
Conversion resolution - DPI
72 DPI default
Selective Page Conversion
Another useful option is selective page conversion, which is dependent on the use case where you only want to perform OCR on selected pages of a document. This significantly reduces runtime by not defaulting to converting the entire document, especially for larger documents. You can specific the range of pages you want to convert using the following two options: -dFirstPage=1 -dLastPage=n. Even if the document size is unknown prior to conversion, you can use any PDF reader (such as Apache PDFBox) to retrieve page count.
Single page conversion is still roughly linear to entire document conversion since there isn’t any noticeable overhead associated with Ghostscript initialization. Significant performance improvements for selective page conversion start to kick for documents over 20 pages. The following should provide a good relative comparison for the different document sizes and conversion times.
Pages in PDF
< 5 Pages
Runtime - converting 3 pages individually
Runtime - converting entire document
Tesseract Performance EnhancementsThe next bottleneck is the core Tesseract OCR process which can also be tuned for performance. One of the allowable optimization that can be applied with Tess4J wrapper method for OCR (doOCR) is calling it in combination with a Rectangle. The Rectangle bounds the region of the image that needs to be recognized while performing OCR. From test runs, the runtime improvement is about 4x when using a Rectangle of dimensions (0, 0, 1000, 1000) in comparison to not using Rectangle.
Following are the runtime improvements when using Rectangles of different size from sample runs:
(0, 0, 1000, 1000)
(0, 0, 1500, 1500)
(0, 0, 2000, 2000)
Although not linear, there are still incremental improvements when using Rectangles of reducing sizes. If your use case does dictate performing OCR on the entire document, this will not be a good optimization candidate. Otherwise the improvements can make a significant impact to your application's runtime.
In terms of usage, the entire OCR process is very CPU and memory intensive. Although the optimizations discussed above are advantageous, they will be eventually capped due to this intensive OCR process. In order to perform OCR validations in bulk efficiently, you need to parallelize the process on multi-core systems. The only caveat there is that Tess4J OCR APIs do not support multithreading. The limiting factor there is the way Tess4J uses Ghostscript. Tess4J uses Ghostcript’s low-level APIs that do not support multithreading.
Accuracy of Tesseract OCR ProcessIn terms of accuracy, Tesseract’s OCR is not completely precise and exhibits some level of variance when interpreting text images into ASCII. Common variance include:
- Misinterpretation of the letter case: Interpreting uppercase for lowercase letters and vice versa
- Mistaking letters, numbers, symbols that share similar ASCII symbol shapes, such as:
OCR Interpreted Character
5, 6 or 8
As stated previously, having higher quality images will also help Tesseract accurately analyze your image. Following are the failure rates after performing OCR on the same set of images converted at different resolutions via Ghostscript. Diminishing returns surface after 250-300 DPI, but any images lower than 200 have poor quality and prove to be ineffective OCR candidates:
Conversion resolution - DPI
< 100 DPI
Rectangles used to perform OCR can also impact the overall accuracy of the result. From experiment, using Rectangles of larger sizes tend to produce more accurate results in comparison to smaller ones.
For our purposes, we were only interested in optically recognizing numerical characters, and hence the noise due to the variance was easily overcome by building a common misinterpretation maps and doing character replacements on any misinterpretations. These variances occurred roughly on at least 20% of the documents we tested. However, if you are interested in performing OCR on alphanumeric characters, then you should explore the options of improving accuracy by training Tesseract to do better image recognition of your document’s fonts and languages.
Finding Optimal Settings
Despite the variances, inaccuracy, and performance overhead, Tesseract combined with Ghostscript still offers reasonable capability to perform optical character recognition in a cost effective way. Ghostscript has a variety of options that can be explored to generate the best suited document for your OCR process. And Tesseract can be tuned and trained to optically recognize your input documents with higher precision and accuracy. The core tradeoff does exist between performance and accuracy, since the two share an inverse relationship. The ideal options can only be discovered after experimenting with your own set of data!
Although not very common, optical character recognition was adopted as a testing technique for a unique scenario because of the strong premium we place on testing at Wealthfront. Software testing is emphasized here because we are a fully automated software based financial advisor and testing is the key method that we use to validate and gain trust in our overall processes. Needless to say, being a financial advisor that manages over $2.5 Billion of client assets, trust is one of the fundamental and uncompromisable values of our core business.