First there was business intelligence. Then came big data. Then came analytics. Then came data warehousing.
Now comes “data lake.”
It’s a term that is just starting to gain traction and probably something that bank CIOs, at least, ought to become familiar with if they aren’t already. It’s a term that others in the C-suite ought to be aware of, since the concept offers considerable potential.
The problem right now is, it can get really technical. Hadoop technical. Schema technical. Hair-on-fire technical.
Why swim in Data Lake?
So it’s probably best first to describe some of the things that data lakes could do to make life better for bankers. Hortonworks, which provides Apache Hadoop services (more on this in a bit) offers a few examples:
• Screen new account applications for risk of default, by storing and analyzing multiple data streams.
• Monetizing anonymous banking data in secondary markets by aggregating data from different lines of business in a bank while de-identifying, masking, and/or encrypting individuals’ information.
• Analyzing trading logs to detect money laundering by accelerating analytics in a shared data repository across multiple lines of business.
“Banks, insurance companies, and securities firms that store and process huge amounts of data in Apache Hadoop have better insight into both their risks and opportunities,” Hortonworks says.
Some questions to answer
Okay, first, what’s Apache Hadoop?
Briefly, Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. (The term “Hadoop” reportedly was coined by the inventor after the name of his son’s toy elephant.)
Wait. It can get even more technical.
First, though, a helpful article in Database Trends and Analysis gives this description of what a data lake is:
“The established method of aggregating data from multiple sources for business intelligence and analytics is a data warehouse. The issue though is that with the volume of data today, data warehouses are not scalable enough or agile enough to keep up. This is where the data lake comes in with Hadoop. Both are flexible, scalable, open sourced, and have performed well with large amounts of data, especially of different schemas.”
Things were starting to make sense there until that very last word: schemas.
What’s a “schema”?
PwC comes to the rescue. In a very enlightening article about data lakes in general it provides a simple definition of a schema: It’s a data model.
What does that mean?
Today, enterprises collect data in many different formats and from many different sources, the combination of which forms data models:
• Structured data comes in from formatted sources, such as account-opening documents.
• Unstructured data comes in from elsewhere, such as from bank customers’ social media posts.
• Other data comes through internet-of-things sources, such as cloud-connected ATMs, for example.
All this data can be valuable, but it’s all in different formats. Companies typically dump all that information into a data warehouse and hope some data scientist can make sense of it all, let alone use it for some business purpose.
In the worst case, the data warehouse becomes the data graveyard.
Back into the lake
Which is where a data lake takes over.
Once again, PwC begins to make things clearer:
“Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model. Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential of operational insight and data discovery.”
PwC makes another data-lake claim: Unlike data warehouses, where useful data can be covered up and even forgotten, the more data poured into a data lake and the more people who access it, the better.
Or, as PwC puts it:
“Sourcing new data into the lake can occur gradually and will not impact existing models. The lake starts with raw data, and it matures as more data flows in, as users and machines build up metadata, and as user adoption broadens. Ambiguous and competing terms eventually converge into a shared understanding (that is, semantics) within and across business domains. Data maturity results as a natural outgrowth of the ongoing user interaction and feedback at the metadata management layer—interaction that continually refines the lake and enhances discovery.”
Phew. Enough already.
Time to get your feet wet?
No doubt there will be more to come on data lakes, but this is a start.
Just know that some heavyweight industry organizations are starting to get into the data lake act. In addition to Hortonworks, EMC just announced its own product, “Federation Business Data Lake.”
Here’s what EMC has to say: “Business data lakes are becoming a top corporate priority because they fill a critical gap left by traditional data warehousing. A business data lake contains structured and unstructured data from a wide variety of sources and the analytics are focused on building models to predict the future. Companies with successful data lakes are leveraging the data and predictive models to build new products, applications, and business models to redefine their industry, taking or extending the ʻmarket leader’ role.”
Maybe it really is time to go jump in the lake.
Sources used for this article include: