Draining the Data Lake

Jeremiah Hansen
6 min readMar 6, 2021

--

(Photo: Lake Berryessa in California, by Doug Letterman/Flickr)

Introduction

In this blog post I will try to make the case that we should both move beyond the file-based data lake, and stop using the terms “data lake” and “data lakehouse” altogether!

Data lakes were created a little over 10 years ago to overcome real limitations in legacy data warehouses. The result has been the introduction of yet another data silo in the enterprise. They were needed at the time but with new, modern, cloud-first approaches to data management like Snowflake the need for disparate file-based data lakes has been eliminated.

Yet most data architects today would still recommend solutions which involve siloed file-based data lakes. To be fair, a file-based data lake is still required for data engineers building a solution today without Snowflake. Yet even data architects working with Snowflake continue to recommend file-based data lakes in their designs. Why? Snowflake has dramatically changed the data landscape and we all need to start thinking differently about the purpose and continued need for file-based data lakes.

First, can we please stop using the term “data lake”?

I’ve never liked the term “data lake” and think that it’s time we retire it altogether. In addition to the negative connotation the term has taken on by introducing an unnatural and artificial boundary in an enterprise data landscape the term itself has always bugged me.

The term was coined a little over ten years ago, in October 2010, by James Dixon in his blog post titled Pentaho, Hadoop, and Data Lakes:

“If you think of a datamart as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

I understand the analogy here, but for me it’s difficult to see how comparing data to a lake helps someone better understand their data. What comes to mind when you think of a lake? Lakes can be beautiful and fun to spend time on. I think of swimming, canoeing, and water skiing. But lake water is often murky, unsafe to drink, and sometimes even unsafe to swim in. In what helpful ways are lakes and data related?

And now a new twist on the phrase, the “data lakehouse”, is gaining popularity. When will it end? Please note that while I really do dislike the terms “data lake” and “data lakehouse”, my comments above are intended to be a bit tongue in cheek.

The real reason I believe that we should stop using the terms “data lake” and “data lakehouse” is that the term “data lake” is today synonymous with a file-based data store. And as I will try to make the case for below, managing a file-based data store is a huge pain and is no longer necessary with Snowflake.

The original purpose for data lakes

Data lakes were created to overcome real limitations in legacy data warehouses. The data landscape changed significantly with the advent of “big data”. And those changes are often grouped into three categories, called the “3 Vs” of big data. Here is a brief summary of the changes:

  • Volume: Need to store incredibly large amounts of data, for the lowest possible cost. But legacy systems either can’t handle the size or it’s not affordable to do so.
  • Variety: Need to store data in many different native formats (structured, semi-structured, unstructured), and to allow for schema-on-read semantics. But legacy systems either can’t store semi-structured data natively or can’t query it performantly. And it’s costly and time consuming to structure the data on ingest.
  • Velocity: Need to ingest both batch and streaming data. But legacy systems can’t handle streaming (near real-time) data effectively, especially given that it is often in a semi-structured format.

Because expensive legacy data warehouses weren’t able to deal with the explosion of data along those three lines, data lakes were introduced to address that. But data lakes didn’t replace the functionality of legacy data warehouses. So the result was that a new, separate, siloed system was introduced. Here’s what the landscape has looked like since data lakes were introduced:

The problem with data lakes

The primary problem with data lakes is that they introduced a disparate data silo into the data landscape (see previous diagram). This has resulted in the following major challenges:

  • Working with FILES!! (most every current data lake is file-based)
  • Very complex to manage and use (multiple tools, languages, file formats, etc.)
  • Proliferates data siloing across the organization (difficult to integrate raw data with modeled data in DW/DMs)
  • Slow performance analyzing data (need to convert to different file formats for performance)
  • Separate platform/compute environment to support (with multiple programming/query languages)
  • Difficult to administer and govern securely (uniform access control across tools, security at the table/row level)

To make matters worse, data lakes were implemented as a series of files in a file system (originally HDFS). Even to the present day “modern” data lakes are still centered around files, even though they’ve been moved to cloud storage. In fact, almost every data professional today would take it for granted that a data lake means files. Here is the same diagram above drawn slightly differently:

But managing all of your data as files represents a huge step backwards in my opinion. Anyone who has ever had to work with data files should recognize how terrible the task is. While some tools have evolved to make the process better, managing files is terrible for many reasons, including:

  • Lack of metadata (column names, column types, comments, etc.)
  • Enforcing file security at the table or row level for a table made up of many different files
  • Entirely dependent on naming conventions (that are different everywhere)
  • Slow performance querying files (especially plain text, need multiple formats depending on analytical use case)
  • Poor performance joining many different data sets made up of many different files
  • Different types of files (full, incremental, changes)
  • Different file formats (uncompressed/compressed, plain text/Parquet/Avro/ORC)
  • Many different CSV formats and file encodings (tool dependent, not obvious, always a challenge)
  • Different types of XML/JSON files (1 object per file, many objects per file)

Draining the data lake

So what should we replace file-based data lakes with? In a previous blog post Beyond “Modern” Data Architecture I discussed the need to move beyond the current systems based way of thinking. With Snowflake the need to maintain separate file-based data lakes (and data marts) has been eliminated. Instead of thinking in terms of systems we need to start thinking about data as it actually is to the business, in terms of data zones (or groups):

The data lake was originally needed to manage your Raw and Conformed data, but to do that required a separate file-based data system. And in fact with most cloud data warehouses today the file-based data lake is still required. However with Snowflake you can manage all of your enterprise data in one platform.

Most of this discussion applies to designing new data solutions from the ground up, but what if you already have a large investment in a legacy file-based data lake? Never fear, the Snowflake cloud data platform can integrate and simplify working with your file-based data lake. You can ingest the data directly from your existing file-based data lake to take full advantage of all the features of Snowflake or leave it as-is and work with it from Snowflake. With features like Snowpipe auto-ingest, external tables (with materialized views), Hive metastore integration, partitioned unload, and others Snowflake can support all of your file-based data lake workloads.

So use Snowflake and let’s drain those file-based data lakes! And please, please can we stop using the terms “data lake” and “data lakehouse”?

P.S. If you’ve read this far and are thinking to yourself that I missed another solution to the problems discussed here, namely the data lakehouse, then stay tuned. That’s the exact subject of my next blog post!

--

--

Jeremiah Hansen
Jeremiah Hansen

Written by Jeremiah Hansen

I’m currently a Field CTO Principal Architect at Snowflake. Opinions expressed are solely my own and do not represent the views or opinions of my employer.