Draining the Data Lake

(Photo: Lake Berryessa in California, by Doug Letterman/Flickr)

Introduction

In this blog post I will try to make the case that we should both move beyond the file-based data lake, and stop using the terms “data lake” and “data lakehouse” altogether!

First, can we please stop using the term “data lake”?

I’ve never liked the term “data lake” and think that it’s time we retire it altogether. In addition to the negative connotation the term has taken on by introducing an unnatural and artificial boundary in an enterprise data landscape the term itself has always bugged me.

The original purpose for data lakes

Data lakes were created to overcome real limitations in legacy data warehouses. The data landscape changed significantly with the advent of “big data”. And those changes are often grouped into three categories, called the “3 Vs” of big data. Here is a brief summary of the changes:

  • Variety: Need to store data in many different native formats (structured, semi-structured, unstructured), and to allow for schema-on-read semantics. But legacy systems either can’t store semi-structured data natively or can’t query it performantly. And it’s costly and time consuming to structure the data on ingest.
  • Velocity: Need to ingest both batch and streaming data. But legacy systems can’t handle streaming (near real-time) data effectively, especially given that it is often in a semi-structured format.

The problem with data lakes

The primary problem with data lakes is that they introduced a disparate data silo into the data landscape (see previous diagram). This has resulted in the following major challenges:

  • Very complex to manage and use (multiple tools, languages, file formats, etc.)
  • Proliferates data siloing across the organization (difficult to integrate raw data with modeled data in DW/DMs)
  • Slow performance analyzing data (need to convert to different file formats for performance)
  • Separate platform/compute environment to support (with multiple programming/query languages)
  • Difficult to administer and govern securely (uniform access control across tools, security at the table/row level)
  • Enforcing file security at the table or row level for a table made up of many different files
  • Entirely dependent on naming conventions (that are different everywhere)
  • Slow performance querying files (especially plain text, need multiple formats depending on analytical use case)
  • Poor performance joining many different data sets made up of many different files
  • Different types of files (full, incremental, changes)
  • Different file formats (uncompressed/compressed, plain text/Parquet/Avro/ORC)
  • Many different CSV formats and file encodings (tool dependent, not obvious, always a challenge)
  • Different types of XML/JSON files (1 object per file, many objects per file)

Draining the data lake

So what should we replace file-based data lakes with? In a previous blog post Beyond “Modern” Data Architecture I discussed the need to move beyond the current systems based way of thinking. With Snowflake the need to maintain separate file-based data lakes (and data marts) has been eliminated. Instead of thinking in terms of systems we need to start thinking about data as it actually is to the business, in terms of data zones (or groups):

I’m currently a Senior Sales Engineer at Snowflake. Opinions expressed are solely my own and do not represent the views or opinions of my employer.