Go jump in Data Lake!

Big data gets bigger—but more manageable

04/04/2015 - 01:31
|
Written by John Ginovsky
|

Bank tech trends can make your head spin. So each week longtime Tech Exchange Editor John Ginovsky does his best to “make sense of it all.”

First there was business intelligence. Then came big data. Then came analytics. Then came data warehousing.

Now comes “data lake.”

It’s a term that is just starting to gain traction and probably something that bank CIOs, at least, ought to become familiar with if they aren’t already. It’s a term that others in the C-suite ought to be aware of, since the concept offers considerable potential.

The problem right now is, it can get really technical. Hadoop technical. Schema technical. Hair-on-fire technical.

Why swim in Data Lake?

So it’s probably best first to describe some of the things that data lakes could do to make life better for bankers. Hortonworks, which provides Apache Hadoop services (more on this in a bit) offers a few examples:

• Screen new account applications for risk of default, by storing and analyzing multiple data streams.

• Monetizing anonymous banking data in secondary markets by aggregating data from different lines of business in a bank while de-identifying, masking, and/or encrypting individuals’ information.

• Analyzing trading logs to detect money laundering by accelerating analytics in a shared data repository across multiple lines of business.

“Banks, insurance companies, and securities firms that store and process huge amounts of data in Apache Hadoop have better insight into both their risks and opportunities,” Hortonworks says.

Some questions to answer

Okay, first, what’s Apache Hadoop?

Briefly, Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. (The term “Hadoop” reportedly was coined by the inventor after the name of his son’s toy elephant.)

Yeah, sure.

Wait. It can get even more technical.

First, though, a helpful article in Database Trends and Analysis gives this description of what a data lake is:

“The established method of aggregating data from multiple sources for business intelligence and analytics is a data warehouse. The issue though is that with the volume of data today, data warehouses are not scalable enough or agile enough to keep up. This is where the data lake comes in with Hadoop. Both are flexible, scalable, open sourced, and have performed well with large amounts of data, especially of different schemas.”

Whoa!

Things were starting to make sense there until that very last word: schemas.

What’s a “schema”?

PwC comes to the rescue. In a very enlightening article about data lakes in general it provides a simple definition of a schema: It’s a data model.

What does that mean?

Today, enterprises collect data in many different formats and from many different sources, the combination of which forms data models:

• Structured data comes in from formatted sources, such as account-opening documents.

• Unstructured data comes in from elsewhere, such as from bank customers’ social media posts.

• Other data comes through internet-of-things sources, such as cloud-connected ATMs, for example.

All this data can be valuable, but it’s all in different formats. Companies typically dump all that information into a data warehouse and hope some data scientist can make sense of it all, let alone use it for some business purpose.

In the worst case, the data warehouse becomes the data graveyard.

Back into the lake

Which is where a data lake takes over.

Once again, PwC begins to make things clearer:

“Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model. Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential of operational insight and data discovery.”

PwC makes another data-lake claim: Unlike data warehouses, where useful data can be covered up and even forgotten, the more data poured into a data lake and the more people who access it, the better.

Or, as PwC puts it:

“Sourcing new data into the lake can occur gradually and will not impact existing models. The lake starts with raw data, and it matures as more data flows in, as users and machines build up metadata, and as user adoption broadens. Ambiguous and competing terms eventually converge into a shared understanding (that is, semantics) within and across business domains. Data maturity results as a natural outgrowth of the ongoing user interaction and feedback at the metadata management layer—interaction that continually refines the lake and enhances discovery.”

Phew. Enough already.

Time to get your feet wet?

No doubt there will be more to come on data lakes, but this is a start.

Just know that some heavyweight industry organizations are starting to get into the data lake act. In addition to Hortonworks, EMC just announced its own product, “Federation Business Data Lake.”

Here’s what EMC has to say: “Business data lakes are becoming a top corporate priority because they fill a critical gap left by traditional data warehousing. A business data lake contains structured and unstructured data from a wide variety of sources and the analytics are focused on building models to predict the future. Companies with successful data lakes are leveraging the data and predictive models to build new products, applications, and business models to redefine their industry, taking or extending the ʻmarket leader’ role.”

Maybe it really is time to go jump in the lake.

Sources used for this article include:

New Federation Business Data Lake Solution Paves Way For Big Data To Disrupt Every Industry Around The Globe

Banks, Insurance & Investment Firms Mitigate Risk While Creating Opportunity

Data lakes And The Promise Of Unsiloed Data

Weighing Pros And Cons Of Data Lake Approach

Tagged under AI, Blogs, Making Sense of it All, Big Data Tech, Feature, Feature3,

John Ginovsky

John Ginovsky is a contributing editor of Banking Exchange and editor of the publication’s Tech Exchange e-newsletter. For more than two decades he’s written about the commercial banking industry, specializing in its technological side and how it relates to the actual business of banking. In addition to his weekly blogs—"Making Sense of It All"—he contributes fresh, original stories to each Tech Exchange issue based on personal interviews or exclusive contributed pieces. He previously was senior editor for Community Banker magazine (which merged into ABA Banking Journal) and for ABA Banking Journal and was managing editor and staff reporter for ABA’s Bankers News. Email him at [email protected].

Latest from John Ginovsky

More in this category: « “Can you know me now?” Omnichannel: To infinity! And beyond! »

Global Instability and Rising Treasury Y…

“Stablecoin Strategy” Is a 2026 Question…

Banks Need to Reconsider their Role in a…

Tokenization Could Reshape Financial Mar…

UBS Expands US Banking Push for Affluent…

Deutsche Bank Supports $50m Climate Fund…

Bank of England Warns AI Agents Could Ne…

Why most European challenger banks fail …

Securitize Set for NYSE Debut Following …

US Banks Raise Shareholder Payouts After…

Colony Bankcorp Agrees $163m First Relia…

Cross River and Fireblocks Discuss How t…

Go jump in Data Lake!

John Ginovsky

Related items

Latest from John Ginovsky

US Pensions Invested $1.5 Trillion in Hedge Funds Last Year

One-in-Four UK Pension Schemes Consider Reducing US exposure

Investors Retreat from US Markets Amid Demand for Global Diversif…

European Pension Funds Shift from US To European Markets

US Public Pension Funds Increased Allocations to Fixed Income in …

Sections

About Us

Connect With Us

Resources