Link to wealthfront.com

Fork me on GitHub

Thursday, July 2, 2015

Testing with Optical Character Recognition (OCR)

Put not your trust in money, but put your money in trust.
-Oliver Wendell Holmes, Sr., The Autocrat of the Breakfast Table

At Wealthfront, we manage over $2.5 Billion in assets that we have been trusted with by our clients. Trust and transparency are the foundations of a strong relationship between a financial institution and its clients. The lack of trust in an institution impedes its efficiency and innovation. At Wealthfront, we place a strong premium on establishing and maintaining trust between our clients.


One of the core ways in which we establish trust is by building robust and reliable software products for our clients. The essential component in ensuring high quality software is testing. Without testing, we cannot validate or have confidence in any software we build. This is how we, at Wealthfront, gain trust in our code.



Logic Validation vs. Data Validation


Software can be tested or validated at various levels using a variety of techniques. Two necessary forms of software testing are: (1) logic validation and (2) data validation. Logic validation is the process of determining whether the applied logic (in the form of code) exhibits the desired behaviour. Whereas, data validation is the process of ensuring that the supplied input data to a system is valid and consistent with intended data types of the system. There exists many tools that facilitate and standardize practices for logic validation in the industry. However, the techniques for data validations are particular to the use cases.


At Wealthfront, we practice data validation in our continuous deployment model to ensure that each deployed service is functioning with the new set of data. As described in the linked blog post, the technique employed is customized for the specific use case in question. Another scenario where we use data validation is for verifying the contents of document images we receive from external partners. As these documents are images embedded in PDF, we cannot use standard PDF parsing techniques to validate their content. Rather the technique we use is called optical character recognition (OCR).


Optical Character Recognition (OCR)

Optical character recognition (OCR) refers to the automated process of translating images of text into machine-encoded text, such as ASCII. It is widely used in commercial applications to store, edit, search and analyze text documents (typewritten or text). This is done in a matter of seconds which would otherwise be a cumbersome manual task. OCR works by scanning your images, extracting the contained text, splitting the text into characters and then recognizing those characters. It can be trained to recognize a variety of different fonts, languages and even handwritten text. In the open source world, Tesseract is perhaps the most accurate and leading OCR engine. Originally developed as a PhD research project at Hewlett-Packard (HP) in the 1980s, Tesseract has been significantly enhanced by Google after it became open source. At Wealthfront, we use Tesseract to do OCR validation on scanned PDF documents.

Since Tesseract uses Leptonica image processing libraries to perform OCR, it only works with image files such as PNGs or TIFFs and cannot work with PDFs directly. It needs to be combined with a PDF interpreter, such as Ghostscript, an excellent interpreter and manipulator of Postscript and PDF files to image files. To perform OCR in Java code, you need a Java Native Access (JNA) wrapper for simplified native library access to Tesseract OCR engine. Tess4J is the JNA wrapper that combines Tesseract DLLs with Ghostscript to provide feature support for PDF documents.


Following is some sample Java code that takes a scanned PDF document, converts it into PNGs, and then performs OCR using Tess4J libraries:



Sample OCR Input (PNG file): Sample OCR Output (text):

This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.








Although Tesseract’s accuracy for interpreting images to text is sufficient and compares well to commercial options, its execution speed is slow. From sample runs, it takes roughly 8-10 seconds to perform OCR on a small pdf document (3-4 pages). The immediate culprit here isn't Tesseract though, it’s Ghostscript. Tess4J’s pdfUtilities internally uses Ghostscript to convert a pdf file to a set of png images.


Ghostscript Performance Enhancements

There are settings that can be tuned to increase the performance of Ghostscript. If you use the default convertPdf2Png method in Tess4J’s pdfUtilities, then custom settings cannot be exercised. However, you can always write your own wrapper for Ghostscript and calibrate settings to optimize the performance of your program, such as the sample:

Ghostscript suggests using the options for multithreaded rendering (increase the rendering bands for concurrency on multi-core systems) via -dNumRenderingThreads=n or giving it more memory for performance improvements. However, from experimentation results, they offered little to no improvement for our set of input data.


Output Resolution


The resolution at which you perform the document conversion does have a direct impact on Ghostscript performance, albeit at the cost of quality of the output image file. While converting documents at lower DPI will reduce the conversion time, they will increase the inaccuracy of the OCR interpretation and vice versa. You can specify the output image resolution with the -rres argument. By default, Ghostscript converts images at 72 DPI which is quite low. Following are the performance results comparison at different DPIs:

Conversion resolution - DPI
72 DPI default
100 DPI
200 DPI
250 DPI
300 DPI
400 DPI
Runtime Increase
--
~1.25X
~2.2X
~2.8X
~3.5X
~13X


Selective Page Conversion


Another useful option is selective page conversion, which is dependent on the use case where you only want to perform OCR on selected pages of a document. This significantly reduces runtime by not defaulting to converting the entire document, especially for larger documents. You can specific the range of pages you want to convert using the following two options: -dFirstPage=1 -dLastPage=n. Even if the document size is unknown prior to conversion, you can use any PDF reader (such as Apache PDFBox) to retrieve page count.

Single page conversion is still roughly linear to entire document conversion since there isn’t any noticeable overhead associated with Ghostscript initialization. Significant performance improvements for selective page conversion start to kick for documents over 20 pages. The following should provide a good relative comparison for the different document sizes and conversion times.


Pages in PDF
< 5 Pages
~10 Pages
~50 Pages
~100 Pages
Runtime - converting 3 pages individually
~1sec
~1sec
~1sec
~1sec
Runtime - converting entire document
~1sec
~2sec
~11sec
~20sec
*Conversion times may vary across different machines


Tesseract Performance Enhancements

The next bottleneck is the core Tesseract OCR process which can also be tuned for performance. One of the allowable optimization that can be applied with Tess4J wrapper method for OCR (doOCR) is calling it in combination with a Rectangle. The Rectangle bounds the region of the image that needs to be recognized while performing OCR. From test runs, the runtime improvement is about 4x when using a Rectangle of dimensions (0, 0, 1000, 1000) in comparison to not using Rectangle.


Using Rectangle


Following are the runtime improvements when using Rectangles of different size from sample runs:


Rectangle Dimensions
No Rectangle
(0, 0, 1000, 1000)
(0, 0, 1500, 1500)
(0, 0, 2000, 2000)
Runtime Improvement
--
4X
2X
1.4X
Runtime (seconds)
1.6sec
0.4sec
0.7sec
1.1sec
*Results were gathered after running doOCR on a PDF document image of US Letter size.

Although not linear, there are still incremental improvements when using Rectangles of reducing sizes. If your use case does dictate performing OCR on the entire document, this will not be a good optimization candidate. Otherwise the improvements can make a significant impact to your application's runtime.

In terms of usage, the entire OCR process is very CPU and memory intensive. Although the optimizations discussed above are advantageous, they will be eventually capped due to this intensive OCR process. In order to perform OCR validations in bulk efficiently, you need to parallelize the process on multi-core systems. The only caveat there is that Tess4J OCR APIs do not support multithreading. The limiting factor there is the way Tess4J uses Ghostscript. Tess4J uses Ghostcript’s low-level APIs that do not support multithreading.


Accuracy of Tesseract OCR Process

In terms of accuracy, Tesseract’s OCR is not completely precise and exhibits some level of variance when interpreting text images into ASCII. Common variance include:
  • Misinterpretation of the letter case: Interpreting uppercase for lowercase letters and vice versa
  • Mistaking letters, numbers, symbols that share similar ASCII symbol shapes, such as:

Actual Character
OCR Interpreted Character
0
O
0
°
I
|
I
:
5, 6 or 8
S

As stated previously, having higher quality images will also help Tesseract accurately analyze your image. Following are the failure rates after performing OCR on the same set of images converted at different resolutions via Ghostscript. Diminishing returns surface after 250-300 DPI, but any images lower than 200 have poor quality and prove to be ineffective OCR candidates:

Conversion resolution - DPI
< 100 DPI
150 DPI
200 DPI
250 DPI
300 DPI
400 DPI
Failure Rate
> 99%
~51%
~18%
negligible
negligible
negligible

Rectangles used to perform OCR can also impact the overall accuracy of the result. From experiment, using Rectangles of larger sizes tend to produce more accurate results in comparison to smaller ones.

For our purposes, we were only interested in optically recognizing numerical characters, and hence the noise due to the variance was easily overcome by building a common misinterpretation maps and doing character replacements on any misinterpretations. These variances occurred roughly on at least 20% of the documents we tested. However, if you are interested in performing OCR on alphanumeric characters, then you should explore the options of improving accuracy by training Tesseract to do better image recognition of your document’s fonts and languages.


Finding Optimal Settings


Despite the variances, inaccuracy, and performance overhead, Tesseract combined with Ghostscript still offers reasonable capability to perform optical character recognition in a cost effective way. Ghostscript has a variety of options that can be explored to generate the best suited document for your OCR process. And Tesseract can be tuned and trained to optically recognize your input documents with higher precision and accuracy. The core tradeoff does exist between performance and accuracy, since the two share an inverse relationship. The ideal options can only be discovered after experimenting with your own set of data!


Conclusion


Although not very common, optical character recognition was adopted as a testing technique for a unique scenario because of the strong premium we place on testing at Wealthfront. Software testing is emphasized here because we are a fully automated software based financial advisor and testing is the key method that we use to validate and gain trust in our overall processes. Needless to say, being a financial advisor that manages over $2.5 Billion of client assets, trust is one of the fundamental and uncompromisable values of our core business.

Tuesday, June 30, 2015

Performant CSS Animations: Netflix Case Study

While going over performant web animations with our new batch of interns, Netflix's new redesign was brought up as an example of something that seemed to have DOM nodes that changed size and pushed other DOM nodes around.

From my previous blog post on Performant Web Animations, animating properties such as width or top forces the browser to recalculate the size and location of every component that may have changed, called a layout. Layouts are something you want to avoid when trying to build performant web animations. The two properties that can be animated well are opacity, and transform.

Netflix's main hover animation has been a great example for explaining how complex animations can be built using only opacity, and transform.

Netflix built their animations by avoiding layouts and being clever with animating transform and opacity. In January, Jordanna Kwok from Netflix said:

To build our most visually-rich cinematic Netflix experience to date for the website and iOS platforms, efficient UI rendering is critical. While there are fewer hardware constraints on desktops (compared to TVs and set-top boxes), expensive operations can still compromise UI responsiveness. In particular, DOM manipulations that result in reflows and repaints are especially detrimental to user experience.

So how did they make this effect, using only transforms and opacity? There are three things happening here on hover. The hovered tile enlarges, the content changes to show a description, and the other tiles in the row move to the sides. Let's break this effect down one piece at a time in order to understand it.

Enlarging The Tile

We have learned that the way we should make an element larger is by using transform: scale(). We can simply apply that transformation on hover. If we have an image that is 100px by 100px, and we ask the browser to grow it to 200px to 200px we will end up with a blurry image. We can solve this by using an image that is twice as large and setting it to 50% of its width at scale(1).

Changing The Content

When you hover over one of the thumbnails on Netflix, it changes the image and text appears. This is done with an overlay the same size as the parent tile that contains the new image and the description. It starts with opacity: 0 and transitions to opacity: 1 on hover.

Sliding The Rest Of The Row

This is one of the most interesting parts of their effect. Instead of letting the browser recalculate the position of every element on the page, or even just in the row that is being modified, Netflix is using JavaScript to manually move the appropriate elements in the appropriate directions using transform: translate3d().

Putting It All Together

We have all the pieces now that we need to make the effect; swapping out the content, making the tile grow, and moving its neighbors out of the way. By using an overlay and animating opacity we can swap out the content. By animating transform: scale we can make an element grow. By animating transform: translate we can move the neighbors. Once we put these effects all together, we end up with a demo just like Netflix's effect.

Conclusion

Beautiful website UIs are using more and more subtle animations to delight the user. In order to build animations that truly feel spectacular, we need to understand how to keep the browser from doing computationally expensive operations. Layouts and repaints are particularly expensive. With this redesign Netflix has given us a great example that a little creativity can give us great experiences even when only animating transform and opacity.

Tuesday, June 16, 2015

An Introduction to CommonJS

The CommonJS specification includes a module loading syntax to cleanly specify JavaScript dependencies. While first widely used in Node it is now used heavily in the browser as well. Historically, when specifying dependencies for the browser, we have had to manually manage a list of our files and keep them in the right order. For the modules to communicate, everything had to be in global scope.

The Basics of CommonJS

With CommonJS modules, every file explicitly states its dependencies, letting the tooling figure out what the ordering is. Using CommonJS modules also keeps us from polluting the global namespace which enables engineers to now write and distribute high quality libraries with shared dependencies.

Every CommonJS module is given two main globals: module.exports and require.

Require Caches Modules

From Good Eggs: "An important behavior of require is that it caches the value of module.exports and returns the same value for all future calls to require. It caches based on the absolute file path of the required file."

You can see that require is returning the same reference for each call.

Different Exporting Styles

There are a few different ways that you can and should export from your module. They are good for different times and different uses. Exporting objects and classes are the most common patterns we use.

  • Objects
  • Classes
  • Singletons
  • Monkey Patch

Objects

You can also use this style to return a collection of functions that don't share any state:

Classes

If you want to have multiple instances with shared state, then you should probably be exporting a class.

Singletons

Since require caches the value assigned to module.exports, all calls to require the module are given the same reference. This makes creating a singleton extremely similar to creating a class. We simply return a new instance of our class.

Monkey Patch

Warning! This pattern is dangerous and should rarely ever be used. Monkey patches are modules that modify global state. You may also hear them called shims.

This is typically a very dangerous pattern to follow because it breaks the dependency graph of the application. Depending on where shim is required we could be changing the behavior of our application.

This is typically only a reasonable thing to do if you are importing polyfills at the very top of your application. This will have pretty much the same behavior as saying:

Public / Private State

One of the great parts about CommonJS is that everything is private except for what is set to module.exports. That means that we can separate our public api and private helpers quite easily:

Simply don't export the things you don't want to have be public, and they won't be accessible!

Conclusion

Specifying dependencies using CommonJS enables us to work on individual units of code and better reason about the environment they are being run in. By explicitly stating the dependencies for each file we don't need to maintain an ordered list of our files and can instead let our tools figure that out for us.

Writing our JavaScript in this way lets us be more explicit about what API modules provide while hiding their implementation. Our engineering team at Wealthfront has found the developer productivity gains and ease of testing invaluable with CommonJS and thinks you will too.

Read More

The CommonJS module syntax is the key behind reusable JavaScript code and is utilized by a large number of tools. Node uses CommonJS by default, and NPM is a package manager for people to distribute and share packages of CommonJS modules.

Browserify and Webpack are both bundlers which can be given a CommonJS module. They will walk the dependency tree of requires and bundle all of the files' dependencies into a single file (generally) that is then meant to be used on websites. Ben Clinkinbeard has written a great post explaining how Browserify works with CommonJS.

Tuesday, May 19, 2015

Performant CSS Animations

Web performance can be split into two relatively distinct categories: first page load and subsequent interactions. We can improve the first load by decreasing the server response time, and optimizing the loading of CSS and Javascript. Once the website is loaded, there is a completely new set of performance related challenges: using a Javascript framework with good inherent speed, immediately responding via the UI to input events, and building animations that feel snappy.

There are many posts online about all of these topics; however, the challenge is knowing what to look for. The goal of this post is to provide a basic level of understanding focusing on animation performance, explain why all animations are not created equal, and provide some other performance related resources.

A Simple jQuery Example

This is the canonical jQuery animation example. We take an element with a class selector and move it down 100px over 2 seconds. While jQuery makes it really easy to do this kind of animation, this actually has really bad performance implications. Let's dig into what is actually happening. The above code is getting turned into something similar to this:

There are two major issues with this approach.

  • vSync
  • Animating Layout Properties

vSync

Notice the setTimeout in the example above. On a 60HZ monitor, we have 60 frames per second that we can update our animation. Our goal is to have a setTimeout callback fire each frame. There are many cases where this doesn't actually happen. Among them, setTimeout and setInterval aren't perfect and thus won't fire exactly when we want it to, once per frame. Instead, you may end up with two updates on a single frame or a frame with no updates at all.

Thankfully, modern browsers provide a function window.requestAnimationFrame that takes a callback which the browser guarantees will be run on the next frame. You no longer need to try to calculate a setTimeout.

More Reading: http://www.html5rocks.com/en/tutorials/speed/rendering/

Animating Layout Properties

There is another reason why we might drop a frame. When an animation is happening, the browser executes your script on every frame, figures out what it needs to paint on the screen, and then paints it. Since this is likely happening about 60 times a second, the browser ends up with a little over 16ms to complete all of that. If a frame ever takes longer than the allotted time, the next frame is dropped so it can continue executing the previous frame.


From Google IO 2013

It becomes important to know what the browser is doing on every frame in order to optimize our animations and get rid of dropped frames. On every frame the browser has a set of steps it takes, which I'll walk through below.

The Browser Frame Waterfall


From http://www.html5rocks.com/en/tutorials/speed/high-performance-animations/

Recalculate Style

The browser first runs your code and checks if any of the classes, styles, or attributes on elements have changed. If they have, it recalculates the styles for elements that might have changed, including inherited values.

Layout

The browser then calculates the size and position of every element on the screen, abiding by the rules of floats, absolute positioning, flexbox, etc.

Paint

The browser then draws all of the elements with their styles onto layers. Think of browser layers like Photoshop layers: collections of elements.

Composite

Once the browser has all of the elements drawn onto layers, it composites the layers onto the screen. This is just like rasterizing layers in Photoshop; it moves the layers around and draws them all into one image.


On every frame, the browser decides what has to be done. If a CSS rule changes the size of an element such as width, margin, or top, the browser will do a layout, paint, and composite.

If a CSS rule changes such as border-color, background-position, or z-index, the browser doesn't need to recalculate the layout, but it needs to repaint and composite the layers together.

There are a few rules that don't require a layout or a paint and the browser can do with just a composite. transform and opacity are the most common ones. Just like in Photoshop, moving layers around is very fast. Photoshop doesn't need to think about what is on the layer; it just moves the entire layer. The browser can do the same thing. Even better, while calculating element sizes for a layout happens on the CPU, things that only require a composite can happen purely on the GPU which is much faster.

http://csstriggers.com is a great resource to learn more about which CSS properties cause a layout, paint, and composite.

More Reading: http://www.html5rocks.com/en/tutorials/speed/high-performance-animations/#toc-dom-to-pixels

What Can We Animate

The key to fast and smooth animations is to only animate things that require layer compositing which is done on the GPU. opacity, and transform: translate, scale, and rotate are our toolkit to work with.

Hover over the boxes to see the effect

See the Pen mywqNB by Eli White (@TheSavior) on CodePen.

See the Pen KwqyOJ by Eli White (@TheSavior) on CodePen.

See the Pen gbRoQQ by Eli White (@TheSavior) on CodePen.

See the Pen rawpoW by Eli White (@TheSavior) on CodePen.

Putting these examples together to do more complicated animations is an exercise in thinking outside the box. Here are some examples.

Animate background-color?

See the Pen YPQEbZ by Eli White (@TheSavior) on CodePen.

The effect is simple. We have two identical overlapping boxes in the same place, one red and one teal. The teal starts at opacity: 0 and on hover of the containing box we transition the teal to opacity: 1.

Alert message

See the Pen zxzPVY by Eli White (@TheSavior) on CodePen.

In order to have a simple message slide in on a div, we position: absolute the alert box and translate it -50px up. Since the parent div has overflow: hidden, we don't see it. On hover of the parent div, we slide the alert box back down to 0.

Flipping Div

See the Pen zxzRvd by Eli White (@TheSavior) on CodePen.

While a little less useful, this is a good example of thinking outside (or behind) the box. While most people would think animations like this would be inefficient, knowing what the browser is doing and clever ways to put things together can actually open doors of possibilities. Check out the codepen for the CSS.

Revisiting the setTimeout Example

We now know the above code is bad!

The setTimeout example that we dissected at the start of this post can be easily modified to take advantage of the knowledge we know now.

As we pointed out at the beginning, the big issues are setting top which is a CSS property which causes a layout to happen and thus also a paint and composite. We will change that to a translate.

The other issue was that using setTimeout doesn't guarantee our animation to be run on every frame. We will change that to requestAnimationFrame. requestAnimationFrame also gives us as a parameter the time since the app started. By changing around our math a little bit to be based on time instead of distance for step and current, our animation will look smooth even if we still miss a frame.

We end up with our animation performance optimized code.

Conclusion

There are three key takeaways to building performant web animations.

  • Know how the browser executes your animations
  • Prefer to animate things that only require compositing while avoiding causing a layout
  • Use requestAnimationFrame over setTimeout or setInterval

In my experience, the hardest part of building a site that has well written animations is actually in the knowledge and communication between designers and engineers. Designers need to understand what palette and toolkit we have to work with to be able to come up with interactions and metaphors that can feel smooth to the user. Engineers need to know how to implement specific ideas and debug what the browser is doing when running their animations.

At the end of the day, all that matters is the experience the client has with the application. If we build things that are slow and jittery, that satisfaction will go down. It is on all of us to continue improving, continue educating ourselves, and continue building awesome experiences.

More Reading

Thursday, April 2, 2015

Similarities of Disruptive Companies

At the end of our Wealthfront Onboarding program, we ask new team members what about Wealthfront most differs from their prior expectations. The most common answer is “Wealthfront is not a finance company.”

This answer is unsurprising in retrospect, as it reflects a common yet fundamental misunderstanding of disruptive companies: many people believe startups win by directly competing against their industry incumbents. For example, they think Uber wins by providing better taxis than Yellow Cab, Airbnb wins by providing better hotels than Hilton, and Wealthfront wins by providing better financial advisors than Charles Schwab. This belief is wrong.

A common corollary of misunderstanding disruptive companies is fearing that joining them will limit future job opportunities. Again, disruptive companies repeatedly disprove this corollary. For example, Uber and Airbnb engineers do not worry their only future job prospects are in transportation or hospitality companies.

Disruptive companies often hold the greatest promise for generalist software engineers to build their career, because Internet-based disruptive companies are most successful when they are driven by technology.

Ingredients of Disruptive Innovation


Nearly all successful Internet-based businesses use convenience, simplicity, and cost to disrupt and reinvent their markets—the three common ingredients of disruptive companies identified by Disruption Theory, popularized by Clay Christensen. They disrupt their markets through the unparalleled benefits of software and the Internet, rather than trying to beat incumbents at their own game.

Uber and Airbnb are two recent successful startups that exemplify this principle:
  • While they provide route selection and accommodation scheduling, nobody thinks Uber owns taxis or Airbnb owns rental properties
  • While they implement specialized mathematics and algorithms for pathfinding and scheduling, the vast majority of Uber and Airbnb engineers are generalists with no prior background in transportation or hospitality industries
Wealthfront is no more like a finance company than Uber is like Yellow Cab or Airbnb is like Hilton. This explains why we repeatedly talk about being a technology company, why nearly all of our technical challenges are unrelated to investing, and why we avoid hiring engineers from the finance industry.

Common Disruptive Technical Challenges


Disruptive companies face remarkably similar technical challenges, despite delivering value across completely different industries. These technical challenges are far different than those faced by their legacy competitors. Here are some examples of the most difficult of these challenges:
  • Automating complex human workflows into horizontally scalable software platforms
  • Building data-driven features derived from complex real-world models
  • Scaling analytics and machine learning as data volume grows exponentially
  • Applying devops to scale operations and infrastructure entirely via code
  • Scaling test automation and infrastructure to ensure high-quality code across the stack
  • Ensuring security and preventing fraud despite rapidly increasing public visibility
  • Acquiring clients online via experimentation, funnel optimization, and quantitative marketing
None of these problems are specific to the finance, transportation, or hospitality industries. On the contrary, they are specific to disruptive companies built for the Internet using modern technology stacks.

Why Generalist Teams Win


Disruptive companies prefer to hire talented generalist engineers, while deliberately not hiring specialists from their legacy competitors. This preference is motivated by four factors:
  • Industry veterans are burdened by obsolescence bias, assuming constraints outdated by mobile and the Internet
  • Legacy technology experience is not required, and instead often gets in the way
  • Specialized expertise rapidly becomes obsolete and thus has a short shelf life
  • Generalists understand how to leverage new platforms and open source technology
These preferences explain why Wealthfront hires top quality generalist engineers and avoids specialists from the finance industry.

Your Market Value & Disruptive Companies


The single most important determinant of your future market value as an engineer is the culture and caliber of engineering teams in which you work. For example, Google and Facebook engineers are in high demand by many companies because their great engineering cultures have attracted many highly talented engineers.

Great engineers want to build software that impacts the lives of many, not focus on a narrow niche. They are best able to do so at disruptive companies, independent of the industry in which they participate. The market value of everyone at a disruptive company grows as a critical mass of great engineers come together.

The next time you consider joining a company, ask yourself whether it is disruptive, has a great engineering culture, and has attracted high caliber engineers. If the answer to all three is yes, then joining that company is likely the fastest path to maximizing both your current and future market value.

Wednesday, February 11, 2015

Pattern matching in Java with the Visitor pattern

I once read an essay—I cannot find it now—that talked about how learning a more advanced programming language can improve your coding even in a language that lacks those features because those language features are really ways of thinking, and there's no reason you cannot think in terms of more powerful abstractions just because you need to express them in a more limited language. The phrase that stuck with me was that you were "writing into" that language, rather than writing in the language as it's meant to be used. At Wealthfront, while the majority of our backend code is Java, we use a variety of methods that originate in functional languages. We've written before about our Option. This article is about pattern matching in Java.

I'm going to take a digression into explaining what pattern matching is and why it's so fantastic. If you like, you can also skip ahead to the actual Java examples

Inspiration from post-Java languages

Pattern matching is a feature common in modern functional languages that allows a structure similar to switch but on the type of the argument. For example, let's take a base class that might have two subclasses, and we want to write logic that handles the two subclasses differently. An example might be a payment record that varies in type according to the method of payment (e.g. Bitcoin vs. credit card). Or maybe an owner that varies depending on whether it represents a single user or a group. This is useful for any class hierarchy representing some sort of data that has a base set of fields and subclasses that may have other data.

The visitor pattern was around before Haskell and Scala wowed everyone with pattern matching, but seeing pattern matching makes it easier to see why it's useful.

Scala pattern matching

Scala supports a match operator that concisely expresses the idea of switching on the type of the object.

What does this do? First, we have an abstract class Foo with two subtypes Bar and Baz. These have different parameters. Bar stores an Int, Baz a String. Then we have a handle method that uses match to extract the appropriate field from either of these. Each matched case could have whatever logic you may need, and the type of b in both of these is the specific subclass, not just Foo.

Swift enums

Swift offers this same functionality through enums, which behave more like types than values.

The syntax differs, but it works out to the same. The let keyword is a helpful reminder of what's going on here -- a new variable is created holding the same value that was in f but now it's of the specific type.

Simpler java solutions

instanceof

One simple solution that comes to mind is to use instanceof.

There are a few problems with this. 1. because Hibernate wraps the class in a proxy object, and then intercepts method calls to make sure the right code is called, objects loaded from Hibernate will never be instances of the derived type, and this will not work. 2. Correctness is not enforced by the compiler. It's perfectly valid Java to say if (f instanceof Bar) Baz b = (Baz) f;, but it will fail every time. 3. Lastly, there is no way to ensure completeness. There's nothing I can do to the existing code to make sure that it gets updated when someone adds a new subtype Qux.

Moving logic into the class

Another solution is to embed this logic in the class, like OOP says we should.

This works fine when it's one method, or a few, but what happens when it grows to dozens? It means that your data objects start to be the location of all your business logic. If that's your style, that's okay and you have a lot of company, but I find it difficult to think about things this way. For example, if I want to verify the identity of owners of individual, joint, and custodial accounts, I could put a "verify identity" method in the AccountOwner type, but I'd prefer to create a single IdentityVerifier class that encapsulates all the business logic about verifying identity. The visitor pattern fits in a model where data objects are simple and business logic in primarily implemented in various processor or "noun-verber" classes.

Another issue with business logic in the data class is that it makes it harder to mock for testing. With a processor interface, it's easier to mock it and return whatever data you want. With business logic in the class, you need to either set up the class so that your data actually satisfies all those rules, or you need to override the methods to return what you want. It makes it harder than it should be to write a test saying something like "accounts whose owners can be identified may be opened immediately".

The visitor pattern in Java

Basic Visitor

The basic visitor pattern in java consists of the following:

  • An abstract base class with an abstract method match or visit taking a parameterized Visitor.
  • A parameterized Visitor class with a case method for each subclass.
  • Subclasses of the base class that each call the appropriate method of the visitor.
  • Application code that creates an anonymous instance of the visitor implementing whatever behavior is desired for that case.

Default visitor

It's sometimes useful to have special logic for some of the subclasses, and a default value for others. This can make the code more readable because it removes boilerplate which isn't part of what the code is trying to accomplish. You can do this with an implementation of the interface that supplies a default value for anything not overridden.

The downside of this pattern is that updating default visitors can be overlooked when a new case is added. One way to handle this in practice is while adding the new case, make the default visitor abstract without implementing the new case, review all the code that breaks, and once satisfied that the behavior is correct, adding in the default implementation for the new case.

Void or Unit return values

We generally define our visitors as being parameterized by a return type, but sometimes no return value is needed. At Wealthfront we have a Unit type with a singleton Unit.unit value to represent a return value that isn't meaningful, but java.lang.Void is also used.

I've used Void in this example to avoid intordu I feel compelled to link to a discussion of why this is not ideal from a functional perspective: void vs. unit.

Destructuring pattern matching

These make up all that you likely need and probably 90% of our use of the visitor pattern, but there's one more item that is worth mentioning. My Scala example above doesn't actually show the full power of pattern matching, because I'm just matching on the type. With case classes, or with a custom unapply method, I can actually match not just on the types of the objects, but on details of their internal structure. For example, using types similar to what I used before, here's a version that treats anything above 10 as "many".

Since this is a language feature in Scala, it's flexible and easy to use. You can simulate the same behavior in Java, but you need to encode the cases that are allowed into the visitor itself.

In some sense, 16 vs. 58 lines is a big difference, but you could also argue that 42 lines of additional boilerplate to simulate this powerful functionality in Java is worth it. This destructuring pattern matching is most useful for value types. That is, objects that just represent collections of data but don't have any other identity attached to them. For entity types, that represent something that is defined as itself regardless of what values it currently has, it's better to use the basic pattern matching.

Is this really pattern matching, and how useful is it?

Some might object that this isn't "really" pattern matching, and I would agree. Pattern matching is a language level feature that allows you to operate on subclasses in a type-safe way (among other things). The type-safe visitor pattern allows you to operate on subclasses in a type-safe way even without language support for pattern matching.

As to its utility, I can say that we use it extensively at Wealthfront, and once people become familiar with it, it's great. Pretty much every polymorphic entity will have a visitor, which makes it safe for us to add new subtypes, since the compiler will let us find all the places we need to make sure it's handled. Visitors on value types, especially destructuring visitors, are much less common. We use it in a few places for things like Result objects that represent a possible result or error.

Give it a try the next time you run into a ClassCastException in your code.