What is a data mesh and how does it work?
A data mesh is a data ecosystem model organized around business domains. It is governed through self-service capabilities that enable cross-functional teams to manage, serve, and ultimately own the data in their domains. It can generate distinct data products that inform key business processes and decisions.
The three main components of a data mesh
1. Domain-oriented data ownership, with federated governance
In a data mesh architecture, the data lives mainly in the infrastructure of different domains, or subject areas, that correspond to distinct business concerns, such as sales and customer support. Each domain may have its own schema.
Cross-functional teams, which include product managers, developers, business analysts, and others within each of those domains autonomously, work with their own data and share it with other domains as needed. These teams are experts in where their data is stored and how to load and transform it. They may connect multiple data sources to their section of the data mesh, using their own dedicated data lake or hub in some cases.
Each team can have its own physical data mesh infrastructure for managing its domain data. Yet co-location of multiple schemas can also be effective, especially for datasets from different domains that are frequently joined with each other: They perform better if stored in the same database. Accordingly, a data mesh can be either a physical or a logical enterprise data architecture.
Even as ownership is divided by domain, federated governance helps prevent this from becoming unmanageable. Standards for data interoperability and quality, plus DevOps culture, ensure this data governance.
2. Product thinking about data sets
Because each business domain is its own separate unit, there are risks associated with domain data becoming too fragmented, to the point that it shatters the prospect of efficient collaboration across the enterprise. This is where the concept of product thinking, as applied to an enterprise’s data sets, makes a major difference in realizing the full value of a data mesh.
Each domain team should view its data assets as components of a data product whose “customers” are other users around the organization, such as developers or data scientists, who need easy and secure access to it. For example, an artificial intelligence (AI) data engineer may need analytical data from a program running within an electronic health records (EHR) system, to improve that software’s algorithms.
A data mesh can deliver this level of convenience across the enterprise through coherent data products. Every product should be:
- Discoverable: A data product goes into a data catalog that includes metadata about its ownership and contents. This setup helps users reliably find what they need.
- Addressable: Each discoverable product should also be uniquely identifiable so that it can then be addressed. Consistent standards for such programmatic access are essential in environments that include a wide variety of data formats, from CSVs to public cloud buckets.
- Trustworthy: Data mesh platforms are meant to enact service-level objectives on domain data owners, governing the trustworthiness of their data products. These products should not require the same level of extensive data cleansing common in a more traditional, strictly centralized data architecture.
- Self-describing: A data product should have clear semantics, syntax, and database schemas for its intended data consumer. “How do I actually use this?” should rarely — if ever — be a question when working within a data mesh.
- Interoperable: The data products in a data mesh should be correlatable across domains. Joining them, for instance, should be straightforward and not hindered by differences in metadata fields or formats.
Think of a data mesh as the enterprise data management equivalent of a customs union, like that of the EU. Each country is its own autonomous entity but simultaneously adheres to certain overarching standards for the exchange of products and services with fellow members. In the same way, domain data teams work independently but also follow global “rules” for the characteristics of their respective data products.
3. Self-service via data infrastructure as a platform
The distribution model of a data mesh would seem to imply the presence of numerous duplicative data pipelines and storage infrastructures, one for every domain. This setup would create technical complications that impede quick and actionable insights. But you can circumnavigate this with a domain-agnostic data infrastructure platform that offers the same level of self-service to every team within the enterprise.
Such a data platform hides underlying complexity and streamlines the processes of storing, processing, and serving data products. Amid current cloud trends and in the multi-cloud world that many enterprises now inhabit, a data mesh should provide:
- Ingestion of any distributed data source, in any format, with scalability in any dimension, e.g., in the data volume or complexity of a query, or in data schema sophistication.
- Choice of cloud, so that enterprises can use the cloud service providers whose analytics ecosystems most closely meet current performance and price requirements.
- Support for hybrid deployments that span on-premises resources and public cloud services.
- An open design that enables teams to use their own libraries, languages they already know (SQL, R, and so on), and well-documented APIs as they build their domain data products.
- Integrated AI and machine learning (ML) to shorten the timeline to advanced analytics from distributed data.
- Separation of compute and storage, to dynamically meet user demands without needing IT to intervene or wasting capacity.
- Easy controls for managing mixed workloads and meeting service-level agreements for multiple applications.
Why data mesh? How it compares to other data architectures
Overall, a data mesh enables increased agility for teams as they work in the cloud with an expanding range of data sources and innovation-centric projects.
Traditional data architectures were sufficient in a world with relatively few data sources and a narrow set of use cases across the business. But now, those centralized models can create bottlenecks for teams that need to move quickly from raw data sources to insights.
Imagine someone, like our aforementioned hypothetical AI data engineer working on EHR systems, needs to create a new data product to meet rapidly changing business requirements. They would likely get slowed down because they would not be able to change relatively small and distinct components for data ingestion and processing on their own — they’d have to get others involved and modify the entire data pipeline.
This scenario is why older data architectures are often described as “monolithic” — changing one part of it means changing all of it. In contrast, data mesh platforms are more like microservices architectures, with individually updatable components that can be worked on by multiple teams.
The flexibility and agility achievable through a data mesh are what separates it from other data architectures built exclusively on centralized data warehouses and data lakes.
Data warehouse vs. data lake vs. data lakehouse vs. data mesh
These four data design patterns aren’t mutually exclusive — they may co-exist in an enterprise, for instance, with a cross-functional domain team that has its own data lake. However, there is traceable evolution from data warehouse to data lake to data mesh, driven by the need to overcome certain architectural limitations.
- What it is: A subject-oriented data architecture that integrates detailed data in a consistent way while maintaining a nonvolatile history of it.
- Benefits: Generates actionable insights (e.g., in dashboards) from huge amounts of curated data, including the creation of predictive analytics and dashboards that drive operational actions. It aggregates data from all enterprise sources in a central location with consistent governance and supports sandboxes for new idea testing.
- Limitations: Not ideal for big data use cases that require the storage and extraction of value from large amounts of raw data, such as that created by IoT devices and web and mobile sources.
- What it is: A set of long-term data containers for managing and refining raw data, using low-cost object storage often delivered from the cloud.
- Benefits: Captures previously discarded “dark data” to drive innovation later on and stores data as-is without having to structure it first. The lake also allows insights to be efficiently captured by AI and machine learning services analyzing raw information.
- Limitations: Relatively few off-the-shelf tools are available for data lakes, which necessitates significant experience with open-source software. There’s also a high risk of silos due to limited governance, and there can be great difficulty balancing issues between security and ease of access.
- What it is: A combination of a data warehouse and a data lake.
- Benefits: Enables an enterprise to systematically extract insights in the mode of a data warehouse — via SQL, machine learning, or any other process — while taking advantage of the vast scale and low costs of a data lake.
- Limitations: Limited agility in adding new features because everything is centralized and monolithic. Data engineers end up spending a lot of their time cleaning up data from teams that have limited incentive to ensure their information is accurate as it goes in.
- What it is: A domain-driven data design pattern divided either logically or physically among the teams working in those domains.
- Benefits: Data mesh allows for autonomous active management of data by the teams closest to it and permits increased agility because there’s no central bottleneck. Each team can create its own data products.
- Limitations: It’s a relatively new architecture that enterprises are still working out. Performance and governance may suffer because users need to go over the network every time to access different data. Without cross-domain governance and semantic linking of data, it can become very siloed and yield disappointing results.
Three reasons data mesh may be the data architecture of the future
Even with its early limitations, data mesh could be the data architecture of the future, for three main reasons:
1. Increased agility and superior organizational scaling
Data mesh empowers teams to access and use data on their own terms, without having to go through the bottleneck of a single, central enterprise-wide data warehouse or data lake. They can use their own warehouses and lakes as nodes within the data mesh, load and query their domain data, and create data products faster.
Data engineers no longer bear the burden of sorting through all of the disparate information that gets dumped into a central data warehouse or lake, because data is being managed in numerous smaller domains instead. As a result, everyone in the organization can more rapidly respond to change and scale their workloads as necessary using a self-service data infrastructure platform.
2. Clear data ownership and accountability
Before the data mesh emerged, ownership of enterprise data was often unclear or even contested. Operational teams in different domains sent their data to a centralized location, where it was handled by specialized data engineers who were siloed from the rest of the organization.
These engineers faced the difficult task of working with data from domains in which they were not necessarily experts. They also served as intermediaries between domain teams working on the same project, working to create datasets that were consumable for all of them.
In a data mesh, ownership is clear-cut, due to domain-driven design. Teams can follow a serve and pull approach — rather than the traditional push and ingest method described above — where different teams work in the domains they know, make data products available across the enterprise, and access the products of other teams as needed.
3. Improved data quality and a DevOps-aligned culture
Because data ownership is obvious in a data mesh, teams have more incentive to ensure the quality of their data products before they distribute them. Quality is further enhanced by the close connection of the data mesh concept with the fundamentals of DevOps.
DevOps emphasizes collaboration through cross-functional teams along with continuous monitoring and refinement of products. When DevOps principles — like breaking work down into smaller, more manageable portions and creating a shared product vision — are applied in a data mesh, the different components of the data architecture are easier to use, iterate, and maintain.
Higher-quality data products can then be delivered faster than before. Just as DevOps is a cultural movement as much as it is a technical one, a data mesh requires the right culture — one that emphasizes accountability and collaboration — to make its technologies benefit the business. DevOps itself helps enable that cultural change.
Constructing a data mesh: Key considerations before starting
Before going all in on data mesh, enterprises should make a few key considerations first, about:
Size and business requirements
A data mesh is ideal for larger organizations with numerous sources and domains, where there is potential friction between teams regarding who owns what.
If an organization does opt for a data mesh, then the distribution of domains should be closely aligned with actual business initiatives, such as the creation of an omnichannel customer experience or supply chain optimization. Such alignment creates clearer objectives for domain data teams, and ensures the data mesh delivers real business value, rather than being a mere experiment.
Data management and governance expertise
Although each domain team owns its data, that doesn’t mean there’s no need for enterprise-wide coordination and governance. Modern tools make it easier for people to get started with complex workloads, but the selection and implementation of those tools still requires thorough oversight from experts.
Data management experts are also useful in guiding each team through the development of its processes and products. Working out these issues early on, with experienced guidance, saves the entire company the time and expense of doing so later.
Schema co-location and performance
Each domain should have a separate data schema, to remove the bottlenecks that arise from working with one schema for all data. In some scenarios, schemas should be co-located and connected for performance reasons. At the same time, it’s important to remember that data integration across all of the domains within a data mesh is critical. Doing so will allow your organization to drive business-directed performance with data placement strategies.
These steps provide the optimal combination of speed and cost for workloads that are highly complex, frequently joined with other data sets, and regularly reused — as long as a high-performance data fabric is in place.
Looking ahead at data mesh’s prospects
Although distributed data ownership is not itself a novel concept, the specific approach to it that data mesh entails is new enough that real-world implementations of it are still rare.
However, many organizations are already evolving their design patterns and cloud solutions to accelerate data model development and better serve customers in ways that closely resemble the impact of a data mesh. Contact us to learn more about the potential of this still-emerging but exciting data design concept.