We often talk about data engineering like it’s plumbing – using words like pipeline and flow – and it’s a useful analogy that works well in most contexts. Sometimes, though, especially in public service analytics, doing so can lead to misleading intuitions about how to approach problems that can occur when trying to manage large quantities of uncontrolled data.
There are real and important engineering challenges in getting large amounts of water to flow through pipes efficiently and reliably, but the focus is usually on the pipes more than on the liquid flowing through them. Likewise, a data pipeline usually includes some methods for detecting and rejecting the data equivalent of sand and gravel trying to flow from the source; if there is a problem with the data, then the solution is to address the problem at the source.
But what if a data engineer can’t just tell the source to clean up their data? What if the pipelines need to process data from sources that are too remote and too numerous or complex to corral into a reliable shape and quality? We call this incorrigible data, and it’s a fairly common and frustrating problem in the public sector.
Comparing private and public sector data
In the corporate world, businesses often have well-resourced and internally empowered teams who provide quality data to their analysts. Though different departments that are the sources of the data may control and define it, the analysts have direct access to both the people generating the data and those who manage the infrastructure that houses it. Even when data is external to the corporation, it’s often obtained through a well-defined API that provides some expectation of structural consistency and quality. As a result, breaches of these expectations by external partners are the exception rather than the rule due to the possibility of contractual or market penalties.
In the public sector, on the other hand, the organization or agency providing the data may have limited resources – human or otherwise – available to respond to modification requests about the supplied data. They often can’t afford to prioritize maintaining a certain shape and content of their data for the convenience of other programs outside their area of responsibility.
Further, a program may receive data from a substantial number of different sources, each supplying basically the same data but in different structures or with different semantics within that same structure. For example, 50 states can have 50 slightly different and constantly shifting versions of any dataset. Trying to build and maintain so many different, constantly shifting pipelines is frustrating and exhausting. But the data engineer can’t ask each state to straighten up and fly right. Instead, data engineers have to come to terms with the reality that they need to live with a certain amount of incorrigible data.
Managing incorrigible data
Fortunately, incorrigible data is not a death knell for scalable reliability; it simply requires a shift in focus. Instead of thinking of data as plumbing in which the pipes are the main interest, think of it a bit more like civil engineering for water resources. In this way, the water is the focus and the engineering efforts are defined by the need to achieve certain outcomes for the given limited control over quality, quantity, and terrain.
The frustrations many stakeholders experience when trying to make use of public sector data stem from trying to use the same practices in municipal plumbing when they actually have lakes and rivers to manage. In other words, those in the public sector can’t use corporate best practices for managing data when they can’t control it in the same ways corporate engineers can.
Engineering for incorrigible data is its own thing and intuitions brought from other domains – like the private sector – can be misleading and distracting. Therefore, the central point that should stand on its own is that the key to managing incorrigible data is in giving control of the data journey to data analysts who are focused on the semantic meaning.
There are many reasons to do this even when the data is not incorrigible, including the desire to save the expense of data engineers. But the reason it’s crucial when dealing with incorrigible data is because people who understand the various data sources’ specific needs and circumstances are best positioned to understand what analytic questions the data is capable of answering. Those same people also understand what feasible steps they can take to adapt business logic to challenges a source’s data may pose.
There’s almost always some kind of negotiation between what a source has the resources and priorities to supply and what is reasonable for the team processing the data to accommodate. Data analysts are best positioned to have those conversations and make those calls.
Enabling data analysts to work on the business logic means freeing them from working on the data processing infrastructure. The less the analyst has to know about details that have nothing to do with the data itself, the better. Any processing that happens in languages with which the analysts are not comfortable should be kept to a minimum. When working with incorrigible data, any “black box” steps in which the analysts are not completely sure what is happening impose greater cognitive load and inject doubt into any reasoning about the data that they need to perform.
This doesn’t mean that data engineers and architects should just hand over the keys to a platform and expect data analysts to build efficient and reliable flows. Assisting analysts to locate source-specific logic only at the beginning of flows and employ DRY standardized logic to eventually achieve a common data model remain important and challenging tasks. Flows must be modularized and each step paired with data validations.
With guidance and support from data engineers and architects, data analysts can take on many of the reliability and maintainability practices of data engineers while maintaining the contextual grasp on data semantics that allows them insight into incorrigible data.
Incorrigible data, corrected
It’s easy to get frustrated and perplexed when techniques and tools that work so well for banks and tech companies feel like such a Sisyphean slog when used in the public sector. The data engineering techniques and tooling optimized for the private sector just don’t fit as well for public sector organizations and agencies because the problem space just isn’t the same. Incorrigible data will always present deep and diverse challenges, but agencies that shift focus away from replicating private sector data engineering solutions and move toward empowering their analysts to take on management of that incorrigible data will find that the data can be tamed after all.