The First Day
On my first day as a backend engineer at Wealthfront I received an overview of our architecture by one of our engineers. During the overview I realized I would be working with several technologies and flow patterns with which I was completely unfamiliar. Fortunately for me, Wealthfront was forward thinking in choosing my first onboarding task. The task would provide me with an introduction to Avro files and our Extract, Transform, Load (ETL) procedures on which our data platform relies. My assignment was to add ETL support for our customer service ticketing system called Request Tracker (RT).
What are Avro files?
Avro is a data serialization system whose primary use is in Apache Hadoop. Avro provides a serialization format for persistent data and a wire format used to communicate between Hadoop nodes. Avro is a compact, compressible file format that can be read in parallel by multiple nodes, even when compressed.
What are ETL procedures?
Extract: Fetch the data from an outside source such as a database.
Transform: Process the data to be in a format that fits operational needs.
Load: Upload the data into the new desired location.
For this task the ETL procedure I wrote worked as follows:
Extract: I fetched required data, e.g. the creation time of a customer ticket from the RT database
Transform: I transformed the data to be properly formatted for storage inside my custom Avro files.
Load: I embedded the output files in our service to upload to our data platform.
At Wealthfront we emphasize that testing is a highly regarded and fundamental part of our workflow. I wrote a method to test my
RtTicketEtl transform method which retrieves data from the RT database, parses the data, and then stores it in an Avro file. The importance of testing was quickly shown to me when I was given the following error message as my output of the test:
This error showed me I had forgotten to set a value for a field in which I intended to store in an Avro file. Catching this runtime exception was crucial as I would not have detected this error until we attempted to update the ETL in production. I quickly added the necessary code to set the missing value and passing all of my tests gave me confidence that my request tracker ETL would be properly created. Building the project, which included running all local test cases, was completed in under thirty seconds. The lightning speed at which I can build projects allows me to keep my focus on the task at hand. This is contrary to my previous experience working with large projects that took over ten minutes to build.
Within a week of starting I was already working on computing metrics in parallel through the use of Cascading. Cascading makes writing Apache Hadoop jobs more manageable by abstracting many of the underlying map and reduce tasks. The task builds upon my previous work implementing the request tracker ETL support allowing me to understand another phase of the data analysis process. The metrics I am computing will be used to analyze how we can accelerate and streamline our signup process for new clients.
My early tasks have given me a sense of satisfaction as I was able to quickly push to production despite the concepts being completely unfamiliar to me prior to joining the team. I look forward to analyzing data knowing the results are metrics we use to shape how we build and improve our products.