R is a powerful statistical programming language with a large range of built-in capabilities that are further augmented by an extensive set of third-party packages available through the Comprehensive R Archive Network (CRAN). At Wealthfront, R is an important tool with many use cases across the company. These include analyzing our business metrics to enable our data-driven company culture and conducting research for our investment methodology, including our recently announced Tax-Optimized Direct Indexing.
Limitations Of CRAN
To support the multiple use cases of R at Wealthfront, we rely on a large number of packages, both third-party packages available through CRAN and packages we have developed internally. The wide availability of existing packages and the ease of creating new ones enables powerful and sophisticated research and analysis, but the management of these packages can be difficult.
For third party packages, CRAN makes it easy to quickly obtain and install new packages, but parts of its design are problematic when used in business-critical settings. At Wealthfront, we faced the following challenges when using CRAN.
Archiving Of Packages And Lack Of Versioning Makes Reproducibility Difficult
On any day, new packages may be added to CRAN, old packages archived, and existing packages updated. Although individual packages themselves are versioned, only the latest package versions are available for download and installation using the built-in install.packages function. By default install.packages downloads the latest package and overwrites any existing older version of the package. The library function then loads the version that was last downloaded and installed.
The ephemeral state of CRAN poses multiple problems. First, the lack of explicit versioning when installing and loading packages makes reproducing analyses and research difficult. The exact results obtained may depend on the specific package versions used and these may vary from machine to machine. Changes to packages may also not be backward compatible, often breaking existing tools and scripts. Further, packages that have been archived on CRAN can no longer be installed using install.packages and must instead be directly downloaded and then installed from the source tarball.
Inconsistent Installation And Dependency Management For Proprietary Vs. Third-Party Packages
In addition to the third-party packages we use at Wealthfront, we also have a large number of internally developed packages we depend on. These packages are proprietary and so cannot be uploaded to CRAN. Having some packages from CRAN and others from private repositories results in inconsistent installation processes for both sets of packages. For third-party packages from CRAN, built-in R functions such as install.packages can be used, but for proprietary packages either built-in R tools such as R CMD INSTALL or methods such as install_git and install_local from the devtools package must be used. Further, R CMD INSTALL and devtools do not handle dependencies between proprietary packages, so the packages must be manually installed in an order that satisfies the dependency chain.
Rather than have packages in two separate locations and different installation and dependency management procedures, we want all packages to be accessible from a single location with consistent steps for installing packages and managing dependencies.
Inability To Review And Document Use Of Third-Party Packages
While there are a large number of packages available on CRAN, the quality of these packages can vary greatly. The quality of any research or analysis is dependent on the packages used and so we must ensure that the packages we rely on are well-written, maintained, and well tested. Thus, for any analysis or research that is production or business critical, we must ensure that we have vetted the packages being used.
WF-CRAN
We developed an internal CRAN with these limitations in mind. WF-CRAN holds the third-party packages we depend on along with our internally-developed packages. The following describes our design choices for implementing WF-CRAN and how these choices addressed the challenges we faced with CRAN.
Repository Versioning
There has been much community discussion and debate about the lack of CRAN package versioning (see this discussion about having a release and development distribution of CRAN and this R Journal article discussing options for introducing version management to CRAN). With WF-CRAN, we took a versioned repository approach. Each time a package is added or modified in WF-CRAN, a new version of the repository is created that includes the latest version of the packages contained in the repository. With this approach, we can continue to use the existing R functions for managing and loading packages, including install.packages and update.packages, by explicitly specifying the repository version using the repos argument. For example, the package xts from version 1.81 of WF-CRAN can be installed using:
install.packages(xts, repos = “http://wf-cran/1.81”)
Further, dependencies of packages, both third-party and proprietary, can be automatically installed by specifying the dependencies argument for install.packages.
CRAN Nexus And Maven Plugins
Rather than write much of the server software from scratch, we chose to implement WF-CRAN on top of Sonatype Nexus by writing a CRAN Nexus plugin. Together with an internally developed CRAN Maven plugin, new packages and package versions can easily be deployed by navigating to the directory containing the package and running the command mvn deploy. Maven and our Maven CRAN plugin are then responsible for running the R CMD build command to create the package source archive and uploading this to WF-CRAN. The Nexus plugin then creates the repository structure and PACKAGES file for the new versioned snapshot of the repository by parsing the package DESCRIPTION files and creating symlinks to the latest versions of each package. We currently only upload the package source to WF-CRAN and so all packages must be installed from source. This is made the default for install.packages by setting
options(pkgType = “source”)
in each user’s .Rprofile.
Source Control Of Packages
The source for the packages themselves are kept in Git repositories. We have multiple repositories containing R packages, with each repository containing packages that support a set of related use cases, such as research or business analytics. We also have a repository that holds the source code for all of the third-party packages we use. This allows us to document which packages we depend on and to explicitly approve the use of new third-party packages by code reviewing the addition of the packages to the third-party repository. Although packages are contained in multiple repositories, once the packages are deployed to WF-CRAN, they can all be installed and loaded using the built-in R functions such as install.packages and library.
With this setup, the workflow for adding or modifying a package in WF-CRAN is:
- Create a branch in the Git repo containing the package
- Have the change code reviewed
- Merge to master and run mvn deploy
WF-Checkpoint
One of the main motivations for WF-CRAN was enabling verifiably reproducible research and analysis. The versioned repository structure of WF-CRAN enables this by allowing us to explicitly specify the version of the repository used in a script. We do this using an approach similar to that of Revolution Analytics’ Reproducible R Toolkit and checkpoint package. Revolution Analytics’ approach involves taking a daily snapshot of CRAN and making these snapshots available through their checkpoint-server. In a script, the checkpoint function is used to specify which daily snapshot of CRAN to use.
library(checkpoint)
checkpoint(“2015-01-01”)
The checkpoint package then analyzes the script to find all of the package dependencies and installs these to a local directory using the versions from the specified snapshot date. The script is then run using these packages.
At Wealthfront, we developed a similar package, WFCheckpoint, that takes a version number rather than a snapshot date as an argument and uses the packages for the specified repository version to run the script. For example, to run a script using packages from version 1.81 of WF-CRAN, the following can be added to the top of the script:
library(WFCheckpoint)
checkpoint(“1.81”)
The WFCheckpoint package, together with WF-CRAN, thus allows us to easily reproduce the results of any research or analysis.
How This Has Helped Us
As a data-driven and research-intensive company, our culture and success is built on producing high quality and reproducible research and analysis. WF-CRAN and WFCheckpoint have allowed us to bring our rigorous engineering practices to R, enabling more scalable backtests, sophisticated and interactive dashboards, and more unified development environment for research.
At Wealthfront, we strive to constantly improve the investment products and services we provide to our clients and WF-CRAN and WFCheckpoint are just a few examples of the tools we have developed at Wealthfront that enable us to do so.