Messy Data IRL: Building Infrastructure For Tidy Data

  1. Extract: The process where relevant data is pulled from raw data, different sources, or data sinks
  2. Transform: Data gets processed and transformed into the correct and relevant shape that the company needs
  3. Load: The output gets loaded into a database or data warehouse
  • Scale: In the beginning, we only had a handful of data points. A year later we have over a thousand moving pieces and more than 40 million decision-makers.
  • Heterogeneity: Our data comes from everywhere. We reconcile records across public and private datasets, database images, APIs, PDFs, hard drives, and the occasional CD. At one point we even received a paper copy of records!
  • Reliability: Data is a first-class citizen in our ecosystem. We’ve set up a system for catching and fixing data reliability issues as they arise to ensure a premium data experience.
  • Metadata: Data governance and data provenance are big these days, and our data presents some unique challenges in this respect. Our most recent and exciting foray on the data management side has been into using metadata to better understand how data flows through our systems.

Scale: Always a challenge when transitioning from minimum viable product to full-fledged product.

We won’t claim to have solved the problem of scale, but we have managed to create a human-centric framework to productively engage problems related to scale. One of the biggest scale-related challenges we encountered was figuring out the problem of orchestrating pipeline runs as well as team members across job functions.

  • Easily understand each process. Processes are small, modular, and self-explanatory to everyone working in the pipeline. Each process has a templated SQL portion that is understood by team members.
  • Process data of any scale. Whether it’s 100 million rows or 10 thousand rows, we don’t need to set up special EC2 instances with more memory or write memory efficient code because SQL and our database takes care of that.
  • Easily test results of the pipeline. Anyone can write arbitrarily complex business logic to test intermediate or final results. Although we log and track universal QC metrics, the nature of heterogeneous data requires custom rules to be written.
  • Anyone with a data background understands SQL. SQL allows for universal access, quick exploration, and introspection of the data in a way that other methods don’t. Plus, almost everyone at BlueLabs knows SQL, so anyone outside of the team can help at any point and time.

Heterogeneity: We can make no assumptions about the shape of our data

When searching for solutions, the heterogeneity of our data was one of our biggest impediments for using an out-of-the-box solution. In effect, we can make no assumptions about the shape of our data. We see data in flat formats, in graph formats, and in various levels of normal form.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BlueLabs Analytics

BlueLabs Analytics

Leading provider of data science services and products for businesses, campaigns, and government.