Navigating the Transition from Data Warehouses to Data Mesh
Written on
Chapter 1: Understanding the Data Landscape
The gap between theoretical knowledge in data science and the practical challenges of working with data in large organizations is significant. When I began my analytics career at one of Australia’s major banks five years ago, I encountered a multifaceted data environment marked by various hurdles:
- Difficulties in locating, accessing, and utilizing data
- Conflicting business priorities creating disarray
- Legacy systems that were challenging to maintain and upgrade
- An entrenched culture that resisted data-driven insights
- Isolated teams lacking communication
Initially, I accepted this as the norm in enterprise data management, convinced that while technology would advance rapidly, user experience would eventually catch up. Despite my training in data science, applying that knowledge in a corporate setting proved to be far from simple. Online courses rarely equip one with the skills needed for real-world challenges.
What surprised me was discovering that my organization’s struggles with data were not unique; they were widespread across the industry. We find ourselves in a rapidly evolving technological landscape, where computing power is expanding, machine learning applications are becoming commonplace, and generative AI is transforming various sectors without pause, all while consumer expectations continually shift.
Everyone in the analytics field is seeking to gain a solid footing as we collectively navigate these changes. The mantra seems to be: “Fail fast and learn quickly.” This prompted me to write this article; I aim to share insights that will aid graduates, new business analysts, and self-taught data scientists in quickly grasping the enterprise-level data landscape and setting realistic expectations.
1. Data as the Core of Digital Strategy
To begin, we must acknowledge the pivotal role data plays in the contemporary competitive business environment. Across various sectors, companies are increasingly leaning towards data-driven decision-making. Simultaneously, consumers expect ultra-personalized digital experiences that utilize advanced analytics, such as AI and machine learning, built on high-quality data.
This demand enables conveniences like on-demand streaming of tailored TV shows, rapid food delivery, and instant mortgage approvals. Thus, having an advanced data stack is crucial for survival and success in the digital age. As Clive Humby famously remarked in 2006:
"Data is the new oil."
IT departments and data platforms are no longer relegated to the background; they are now integral to enterprise strategies. Data has become a primary asset, driving virtually every aspect of business operations. Now, let’s delve deeper into how data is structured, processed, and stored in large corporations.
At a high level, the data landscape can be divided into operational data and analytical data.
2. Operational Data
Operational data typically consists of individual records representing specific events, such as transactions or customer interactions. This data is vital for daily business operations and is stored in databases accessed by microservices—small software programs designed to manage data.
This data is continuously updated and reflects the current status of the business. Transactional data, a crucial subset of operational data, includes:
- Money transfers between accounts
- Payments for goods and services
- Customer interactions across various channels, like online or in-branch
Data that is produced directly from applications is referred to as source data, or System-of-Record (SOR). Source data remains unaltered and is the preferred format for data scientists, forms the basis of data lakes, and initiates data lineage processes.
Online Transactional Processing (OLTP) systems are designed to process numerous transactions swiftly. They depend on databases that can efficiently store and retrieve data, ensuring accuracy through ACID principles:
- Atomicity: Each transaction is treated as a singular unit.
- Consistency: Transactions either succeed or fail entirely.
- Isolation: Multiple transactions can occur concurrently without interference.
- Durability: Changes to data are preserved even in the event of a system shutdown.
OLTP systems are essential for critical business applications, handling deposits, withdrawals, and balance inquiries, among other functions.
3. Analytical Data
Analytical data provides a time-based, aggregated overview of a company's operational data, enabling insights into historical performance and facilitating data-driven future decisions. This type of data is commonly used to create dashboards and reports by data analysts and to train machine learning models by data scientists.
Businesses are increasingly adopting powerful business intelligence tools and no-code machine learning platforms to democratize data and analytics capabilities. The shift towards data democratization is significant, as many organizations now aim to empower a larger number of employees with data skills, enhancing productivity across the board.
Analytical processing contrasts with transactional processing, as the former focuses on data analysis while the latter records specific events. Analytical systems typically utilize read-only environments that store extensive historical data, enabling analysis on snapshots of data at specific times.
Connecting operational data with analytical data occurs through data pipelines, usually constructed by data engineers. These pipelines, often designed as ETL (Extract, Transform, Load) processes, involve extracting data from operational systems, transforming it for business needs, and loading it into data warehouses or lakes for analysis.
4. Data Warehouses and Data Lakes
The analytical data landscape has evolved into two primary architectures: Data Warehouses and Data Lakes. Different users engage with data at various stages within enterprise architecture. Data analysts typically query data warehouses to create effective dashboards and reports, while data scientists often explore data lakes for their prototyping needs.
In terms of environments, it's crucial to understand two categories:
- Non-production: A space for experimentation where changes are low-cost and do not disrupt business operations.
- Production: The deployment environment for finalized applications and systems, requiring high security and reliability.
4.1 Data Warehouses
Data Warehouses are established systems for storing structured data in a relational schema optimized for read operations. Key features include:
- Historical Analysis: Quick querying of large volumes of historical data for descriptive analytics.
- Schema-on-Write: The schema is predetermined, facilitating structured data storage.
- Data Modeling: Analysts create data models that pre-aggregate data for easier reporting.
- Fast Queries: Utilizing OLAP models for quick data retrieval and analysis.
4.2 Data Lakes
Data Lakes serve as the preferred method for storing large volumes of file-based data. They utilize distributed computing and storage to manage unstructured data efficiently. Notable characteristics include:
- Schema-on-Read: Schemas are defined only when data is read, offering flexibility for data scientists.
- File Types: Support for various unstructured and semi-structured data formats, including text, audio, images, and more.
- Cloud Computing: Increasingly hosted on public cloud services, allowing for scalable resource allocation.
5. The Emergence of Data Mesh
Architect Zhamek Dehghani has identified three generations in the evolution of enterprise data:
- First Generation: Proprietary data warehouses and business intelligence platforms that have left companies with significant technical debt.
- Second Generation: Big data ecosystems that often resulted in cumbersome data lakes with limited real-world utility.
- Third Generation: Current architectures that incorporate streaming and cloud-based services for enhanced data management.
Data mesh represents a shift towards a decentralized model, where data ownership is distributed across teams that best understand the data. This approach fosters independent work, agility, and the creation of high-quality, reusable data products.
Data mesh encourages a collaborative culture where data is treated as a valuable asset, driving efficiency and reducing redundancy in data management processes.
6. Data Governance
Data governance involves establishing clear roles and responsibilities concerning data management within an organization. It encompasses:
- Data Privacy: Protecting sensitive information to maintain customer trust.
- Data Security: Safeguarding data against both internal and external threats.
- Data Quality: Ensuring high-quality data for reliable insights.
Effective data governance is crucial for maintaining data integrity and minimizing risks associated with data management.
7. Conclusion
Navigating the complexities of enterprise data architecture can be challenging, often resulting in accumulated technical debt. However, organizations are increasingly recognizing the need to decentralize data management and implement data mesh principles.
My experience at a major bank has shown me the evolution from traditional data warehouses to modern data lakes, and now towards a decentralized mesh architecture. As we work to untangle our data landscape, I am hopeful that this shift will lead to improved data quality and greater overall efficiency.
If you resonate with these experiences or have insights to share, feel free to connect with me on LinkedIn, Twitter, or YouTube.