From data catalogs to data virtualization and enterprise knowledge graphs, we look at how Europe’s data leaders are facilitating data discovery in complex, multinational organizations
Even in a mid-sized company, building data infrastructure that meets everyone’s data discovery and usage needs can be a huge challenge. But imagine being the data leader for a vast multinational with dozens – or even hundreds – of companies under its umbrella.
That vision is a reality for Vipul Parmar, Global Head of Data Management at advertising giant WPP.
“Even doing simple things like understanding what types of data resources we have out there is very difficult,” he says. “It’s a huge challenge.”
“Normally, you work in an organization and you’d navigate your way through all the various departments and teams,” he continues. “This is that times 400 or so.”
“Even doing simple things like understanding what types of data resources we have out there is very difficult”Vipul Parmar, Global Head of Data Management, WPP
This challenge isn’t unique to WPP. Virtually all enterprises that have grown through mergers and acquisitions will inherit multiple data architectures that weren’t built to fit together.
When the CEOs of these organizations decide to pursue a strategy built on synergies and economies of scale, it then falls on these enterprises’ data leaders to unravel everything.
“We have platforms all around the world and each of them is different,” says Manuel de Francisco Vera, Global Head of Analytics at eBay Classifieds Group. “eBay eCG is a company formed of many companies.”
“My role is first to make sure that we have a common data layer and foundations across all the markets,” he continues. “We make sure that we compare apples to apples.”
He adds: “My role is also to identify synergies. So, opportunities that liberate great work that one country is doing and make it global.”
Working out how best to store data so that different regional teams can find and use it easily is the million-euro question, here. For a time, some thought the solution was to pool everything in an enterprise data lake. But thinking on this topic has evolved over the years.
The Rise of the Data Catalog
Some say the ‘era of big data’ ended on June 5, 2019. That was the date that Cloudera shares plummeted 43% following the announcement that Tom Reilly was stepping down as the big data platform’s CEO.
This event was seen by many as marking a shift in thinking about enterprise data management and storage. The age of trying to pool everything in a centralized data lake was over.
“Centralization doesn’t always work,” explains Parmar. “We know we can create these huge data lakes if we want to. But sometimes, they end up becoming data swamps.”
“We want to influence people in moving to their own right architecture. But not necessarily a centralized one”Vipul Parmar, Global Head of Data Management, WPP
Today, a huge enterprise like WPP will typically have a distributed data architecture made up of multiple data stores, from file repositories to cloud databases, on-premises databases and beyond.
“There’s tons of these instances and there’s a multitude of them,” Parmar says. “We don’t want to change that. We want to influence people in moving to their own right architecture. But not necessarily a centralized one.”
Enterprise data leaders once dreamed of creating a ‘single source of truth’ for their data. But in this context, it makes sense to create a ‘single source of reference’ to unify these distributed data stores instead.
A data catalog is one way to create a single reference point for all an organization’s data. Data catalogs provide data professionals with an organized inventory of all the data assets available to them and use metadata to make all that information ‘searchable’.
“The data catalog is going to be an integral part of a lot of our working practices,” says Parmar. “It’s more important now than ever that we get a handle on all the data we have.”
Where Data Virtualization and Knowledge Graphs Fit In
Data catalogs discover, tag and classify data and map it to a business glossary so people know what the data means. They also contain metadata on data lineage and help to organize the data an enterprise is storing.
These catalogs can be plugged into data preparation tools, data science workbenches and BI tools. But data leaders are also looking at how other tools can make a distributed data architecture easier to work with.
“When it comes to ‘ways of working’, we’re looking at virtualizing our data, so we can start using it without being impeded by the environments those data sit in,” Parmar explains.
Data virtualization tools are a type of ‘data fabric’ that can work with a data catalog to let users preview data from within the catalog, integrate data stored in different places and provision access to that data. Crucially, this can be done without moving or transforming data from their original sources.
Enterprise knowledge graphs are another emerging technology designed to make the data an enterprise stores easier to search and understand.
30% of organizations will use graph technologies to facilitate rapid data contextualization by 2023Source: Gartner, 2020
Mapping data using a graph database can help enterprises replace traditional, keyword-based search tools with contextual ones that understand user intent and tailor their results accordingly. This is the same technology that Google uses to create the ‘knowledge panels’ that appear next to certain search results.
“For me, utilizing this type of technology was a no-brainer,” says David Meza, Senior Data Scientist at NASA. “It made it really easy for our end users to visualize that data, see connections, find out how the data was being impacted and to find information much more quickly.”
“We had been using a standard key list search but needed to examine the data relationships between different lessons learned over the last 50 years,” he adds. “Graph databases simplified that process for us by connecting lessons across those spectrums.”
How widespread these technologies will become remains to be seen. But it’s possible we will start to see more enterprises creating virtualized knowledge graphs to enable real-time data integration and intuitive data discovery.
Given that centralized data storage and management is no longer an option for Europe’s biggest enterprises, it looks like data catalogs, data fabrics and knowledge graphs are the future of enterprise data discovery.