We ❤️ Open Source
A community education resource
Building your data plumbing: Lessons from a backyard drainage project
Use system mapping, data lineage, and scalable tools to avoid downstream data breakdowns.

In our previous article, we established the basic principles of effective data infrastructure. Now, let’s dig deeper into how organizations can create frictionless data sharing across business units.
There I was, knee-deep in clay, installing a drainage system in my daughter’s backyard. As I connected downspouts to pipes leading to the back easement, I couldn’t help but see parallels to building effective data infrastructure.
Mapping your sources
Step one of our drainage project: Identify all water sources and map their natural flow. Similarly, in data infrastructure, we must first identify all data origins and understand their pathways.
Take a point-of-sale transaction. It includes customer details, product orders, and the transaction record itself. Accurately identifying and documenting these original data sources—a process called data lineage—prevents countless headaches later on. By cataloging each data provider (whether internal systems or outside vendors) and capturing consistent metadata, you create the foundation for smooth data flow throughout your organization.
Avoiding underground hazards
Before digging, we carefully marked all underground utilities. One wrong move with the trencher could mean cut power, communications, or worse.
In your data ecosystem, this translates to identifying all systems that will either consume or contribute to your source data. That POS transaction we mentioned? It will travel through inventory systems, payment processing, shipping logistics, and more.
Critical to this mapping is understanding whether each system simply references the data or actively modifies it. Each data attribute must have one clear owner with sole authority to change it (what developers might call a bounded context). When this principle is violated, data quality suffers and decisions are based on incorrect information.
Understanding system interactions
Just as we used color coding to identify different utilities (yellow for gas lines, orange for communication lines), your data infrastructure needs clear dependency analysis and mapping. This documentation ensures you understand exactly which downstream systems will be affected when source data changes—and how those effects will manifest.
For example, if your customer address data changes, will it automatically update shipping profiles? Will it trigger a verification process? Without proper dependency mapping, changes to one system can cause unexpected breakdowns in others.
Read more: Rethinking data infrastructure: A guide to AI-ready systems
Choosing the right tools
For our backyard project, we rented a trencher to handle the heavy digging. But we quickly discovered it was too powerful for our soft, sticky soil—the machine kept getting stuck and spinning its wheels.
This mirrors a common mistake in data infrastructure projects: Organizations often invest in enterprise-grade solutions before understanding their actual needs. They purchase powerful systems with advanced capabilities when what they initially need is something more basic and agile.
Before introducing a sophisticated new capability like AI or machine learning, assess your current systems, people, and processes. Are they prepared to integrate with new technology? Have you completed the foundational work of data lineage, metadata documentation, and dependency mapping?
The inevitable cleanup phase
Despite its benefits, our trencher left quite a mess in its wake—displaced soil, disturbed landscaping, and two exhausted weekend warriors. Similarly, streamlining data flow through your organization will temporarily disrupt your operational landscape.
Implementation teams often find themselves overwhelmed as they approach the finish line, discovering there’s still significant work ahead. This is where careful planning and resource allocation become critical. While there are no shortcuts around sound data management principles, proper preparation can minimize disruption.
The manual labor phase
The next day brought the hardest part of our project: Manual digging around existing utility lines—a battle of sheer willpower and endurance.
Data projects involve similar grunt work. Data cleansing, conversion, and quality assurance require focused effort and attention to detail. This is where outside consultants and temporary resources often prove valuable. Even more critical are your internal subject matter experts who understand the nuances of legacy systems. These rare individuals can help you navigate potential pitfalls—secure their involvement early!
During this phase, consistency becomes paramount. A program management office and daily dashboard can help ensure all teams remain synchronized and accountable.
Read more: Revisiting data quality in the age of AI and ChatGPT
Deploying the infrastructure
Finally, after all that preparation, we were ready for the satisfying part: Installing catch basins and connecting drain pipes to channel water away from the house.
In data architecture, these drain pipes represent your data flows, while catch basins function as data hubs—key junctions responsible for collecting upstream data and incorporating information from adjacent sources. The size and capacity of each hub reflect your technology choices: A data warehouse, operational data store, or analytics platform.
In our drainage project, gravity was the essential force moving water through the system. In your data infrastructure, accessibility and quality create a similar gravitational pull. When data is reliable and readily available, it naturally flows to those who need it most, reaching every corner of your organization.
Looking ahead
Just as our backyard project required careful planning and hard work to ensure proper drainage, building effective data infrastructure demands thoughtful preparation and execution. When done right, both create systems that operate smoothly with minimal intervention.
In our next article, we’ll explore the second key concept of data infrastructure: Promoting the adoption of trusted data through systems and processes designed specifically for that purpose. Until then, I’ll be nursing my aching muscles and admiring our new drainage system in action.
Additional resources
Overview topics
- What is Data Management? (CIO wiki)
- A brief history of data management (Dataversity)
- What is data management? (IBM)
Data storage
- Introduction to data lakes (Databricks)
- What is a data lake? A super-simple explanation for anyone (Forbes)
Ontology
- Open Knowledge Graph Lab (EDM Council)
More from We Love Open Source
- Rethinking data infrastructure: A guide to AI-ready systems
- Revisiting data quality in the age of AI and ChatGPT
- Demystifying external data as a service
- Optimized GraphQL data fetching strategies
- How to get involved with We Love Open Source
The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.