Columnar storage, massively parallel processing and in-memory analytics are helping enterprises deliver results with high-performance analytics use cases
Relational database management systems (DBMS) are still hugely popular for business analytics. Anyone who’s ever glanced at an Excel spreadsheet is familiar with the concept of storing data in rows and columns, and market leaders such as Oracle have earned reputations as trusted data storage providers.
But as enterprises advance down the path to analytics maturity, some are finding that the performance of relational DBMS and some cloud-based data stores are no longer suitable for their needs.
“We’re certainly very much on the edges of what our existing technology can do, in terms of size of data and the speed at which we want to get analytics to the right people,” says Premal Desai, Head of Data and AI at The Gym Group.
Today, it’s increasingly common for data-focused executives to explore alternative storage and compute options. These may include columnar storage, massive parallel processing or in-memory analytics.
“We’re certainly very much on the edges of what our existing technology can do”Premal Desai, Head of Data and AI, The Gym Group
Columnar Storage at MoneySuperMarket
Traditional databases are optimized for storing data in rows. But columnar databases take those rows, flip them 90 degrees and store them each separately as columns.
This design allows for queries to read only the columns they need, rather than reading the entire table and discarding the data it doesn’t need after it has been stored in-memory. This leads to faster querying times.
Price comparison website MoneySuperMarket uses columnar storage to enable fast analysis of how customers are interacting with its site.
“We run a website, predominantly,” explains Harvinder Atwal, Chief Data Science Officer at MoneySuperMarket. “That’s where our data comes from.”
“We get data in real-time and stream it to our data lake, before it is transformed and loaded into the data warehouse,” he continues. “The data scientists and the data analysts can work with the data in the lake for near real-time analytics or the warehouse.”
He continues: “So, the starting point for us is real-time data being ingested into a data lake, but also the ability to effectively integrate and aggregate data from lots of different sources into that same data lake, rather than having it scattered across platforms.”
This approach allows Atwal’s teams to run queries and refresh their dashboards in a matter of seconds. He says this is an ideal configuration for the analytics use cases they’re currently delivering.
“I think most people, if they needed real-time analytics, would work directly on streaming data and do streaming analytics”Harvinder Atwal, Chief Data Science Officer, MoneySuperMarket
Massively Parallel Processing and The Gym Group
Massively Parallel Processing (MPP) is another data infrastructure innovation that can supercharge the processing of analytics queries. It involves using multiple processors to run a program simultaneously, with each processor working on its own part of the job.
This kind of storage configuration is often employed in conjunction with columnar storage. It allows databases to handle massive amounts of data and provide even faster analytics based on large datasets.
Desai sees this technology playing a key role in the next phase of The Gym Group’s analytics journey.
“We’ve got 185 gyms around the country,” he says. “Knowing how we deal with pricing and knowing how demand adjusts to that pricing is something that is super important to our business.”
Desai adds: “Where we’re trying to go is, as well as the pricing and churn use cases, trying to understand member behavior at a granular level.”
For The Gym Group, there’s value in understanding how members interact and engage across multiple touch points – physical and digital. As the gym’s membership and interactions expand, the amount of compute power required to deliver tangible insights and modelling capability will increase exponentially.
Desai concludes: “That really takes us into the world of parallelized processing. As you can imagine, the processing power needed to be able to accomplish that is actually quite phenomenal.”
“Where we’re trying to go is, as well as the pricing and churn use cases, trying to understand member behavior at a granular level”Premal Desai, Head of Data and AI, The Gym Group
Enhancing Business Analytics Performance with In-Memory Processing
Of course, there are some analytics use cases where even MPP won’t be fast enough. When use cases require real-time insights, need systems to upload large quantities of data very frequently or must support large numbers of users accessing a system at once, in-memory processing may be the answer.
In-memory processing is a technology that allows data to be analyzed entirely in computer memory (e.g. RAM). This means processing performance isn’t slowed by the latency that occurs when data must be read from elsewhere.
The fastest data storage options will leverage columnar storage, MPP and in-memory processing in unison. And this is how non-profit healthcare provider Piedmont Healthcare transformed the performance of its self-service portal.
Mark Jackson, Head of BI at Piedmont Healthcare, recalls: “On the data prep side of things, our self-service data sources for hospital billing went from processing in six hours to four minutes. So, we went from having 14 months of data available to now nearly 10 years of data.”
In addition to harnessing the rapid compute speed that comes with in-memory processing, Piedmont Healthcare selected a solution that automatically tunes its database configuration in response to user behavior. This means the database automatically adjusts its performance parameters based on user behavior, without the need for an administrator to tune things manually.
Jackson continues: “On the consumer side of things, our infection prevention dashboard went from loading in one minute as a Tableau extract to 10 seconds, with double the amount of data.”
Through implementing an in-memory processing solution, Piedmont Healthcare increased the number of users its Tableau data visualization tool could handle simultaneously from 24 to 311.
In the process, the company was able to dramatically improve the quality of care it provides to its 2 million patients, generating a 40% harm reduction.
“We achieved zero harm for 27 metrics across our hospitals, with the help of the insights that we offer to our quality and processing bereavement team members”Mark Jackson, Head of BI, Piedmont Healthcare
As enterprises scale their analytics ambitions, they must also scale their data infrastructure to support faster processing speeds on larger datasets.
For many companies, conventional databases or cloud-based data stores may be fast enough. But when enterprises need a seriously high-performance architecture to support large-scale data-driven business processes, it may be worth exploring an infrastructure that includes in-memory processing.
Desai concludes: “Where businesses really need to do massive amounts of parallel processing in near-real-time, that’s where you can see certain advantages of the in-memory play.”