Evolution of the Open Metastore

15 min read4 days ago

Introduction

The “catalog” is currently at the forefront of most data platform architecture discussions today. But the term “catalog” is a very overloaded term in the industry and in my previous blog post, Catalogs from Sears to Iceberg, I tried to provide a framework to help distinguish different categories of “catalogs”. As promised in that previous blog post, this one will focus on what I believe is the most interesting category of “catalogs”, namely the database object catalog (or metastore). Here I will provide a brief overview of open source metastores, with a focus on the most recent iteration, the Iceberg catalog.

The purpose of the database catalog from the beginning was to track metadata about the objects supported by the database and to provide a mechanism for inspecting that metadata. Enterprise databases have always supported a large number of objects to enable users to most effectively work with their data. Tables (of data) are the most important and most foundational objects supported by a database, but there are other objects that are important such as users, roles, indexes, views, materialized views, stored procedures, UDxFs, triggers, sequences, and sometimes user-defined data types. And the metastore is responsible for capturing the necessary details/metadata about each.

When it comes to tables of data, the metastore captures the core structural/technical metadata such as table attributes, column names/types, and more. Additionally, the metastore also often captures statistics about the data to help optimize query performance such as min/max values, null counts, distinct value counts, row counts, byte sizes, and more. For the remainder of this article we will briefly trace the history and evolution of open source data lake metastores from the Hive Metastore, to the Spark Catalog, and finally to the Iceberg Catalog.

A few quick notes on terminology:

Metastore: For the sake of this article, I use the terms database catalog, database object catalog, object metastore, and metastore all synonymously. And for the remainder of this article I will most often use the term metastore since it’s clear and short.
Database: For some the term “database” is used to refer to legacy RDBMS or DW systems and no longer used to describe newer systems today. But I see no reason to throw away such a good term. The term “database” simply means a foundational system for organizing and managing data, and that definitely seems to still be useful. For this article I will use database and data platform synonymously.
Data Lake: There are many definitions of the term “data lake” (a term that I’ve always disliked). But the most straightforward definition is a customer owned, file-based data store, which is decoupled from any compute engine. Data Lakes began with Hadoop in HDFS and continue today with cloud storage services.

As you read through the deep dives that follow, pay particular attention to the scope of features supported by each metastore. Even though all are called metastores (or catalogs) the scope of each can be very different. What’s most important in discussions about “catalogs” is which objects are supported and which features of each object type are supported. Let’s jump in!

Hive Metastore Deep Dive

The Hive Metastore (or HMS for short) has been the de facto standard open metastore for the file-based data lake world for a while now. Hive was originally developed in Hadoop as a database-like SQL-interface for querying data in HDFS. It was created at Facebook in 2007 and became an Apache Software Foundation project in 2010. The HMS has evolved over the years to support a wide range of objects and features that you’d expect to find in an enterprise database.

Many large organizations choose to host their own HMS instances, and most all metastore services (usually referred to as “catalogs”), often hosted by the cloud service providers (like AWS Glue), expose an HMS API. Of course many of these services add proprietary features/extensions, but all offer HMS API compatibility (more on that later).

Features

At this point you might be wondering what all the HMS covers. Here is a summary of the objects for which the HMS captures metadata:

Databases
Tables (managed, external)
Partitions (for partitioned tables)
Columns
Statistics (tables, columns, partitions)
Table Constraints (primary keys, foreign keys, etc.)
Indexes (now deprecated)
Materialized Views
Privileges and Roles (authorization-related metadata)
SerDe Information (storage format details)
User-Defined Functions (UDFs)
Data Connectors
Stored Procedures
Workload Management
Replication
Table maintenance
Transaction control

Schema

To understand the HMS schema, here is a helpful ER diagram for visualizing the structure of the HMS backend store. Please note that this diagram is from 2016, and while some new features have certainly been added since then (notably data connectors, stored procedures, MVs, workload management and replication) the core is still essentially the same.

Diagram from Analytics Anvil’s article Useful queries for the Hive metastore

The Apache Hive source code includes a number of helpful resources for working with the HMS. The first thing it includes is SQL DDL scripts for each of the major RDBMSs that are used to host the HMS backend, including SQL Server, Oracle, MySQL and PostgreSQL. For instance here is the most recent version 4.1.0 of the HMS schema DLL for PostgreSQL. The second thing it includes is a tool, named schematool, to help migrate the HMS backend to newer versions.¹

API

The Hive Metastore (HMS) also includes an API for accessing the metastore, which is based on Apache Thrift. Thrift is a framework for developing cross-language services and their corresponding APIs. The definition for a service in Thrift is captured in an “Interface Definition Language” (or IDL) file. You can find the most recent version of the HMS Thrift IDL in the Hive source code. While there are currently around 280 methods in the HMS Thrift IDL here is a sample of some of the methods to give you a flavor of the interface:

Database-related methods:
- get_all_databases: Get a list of all databases.
- get_database: Get details of a specific database.
- create_database: Create a new database.
- drop_database: Drop a database.
Table-related methods:
- get_all_tables: Get a list of all tables in a database.
- get_table: Get details of a specific table.
- create_table: Create a new table.
- drop_table: Drop a table.
- alter_table: Alter the structure of a table.
Partition-related methods:
- get_partitions: Get a list of partitions for a table.
- get_partition: Get details of a specific partition.
- add_partition: Add a new partition to a table.
- drop_partition: Drop a partition from a table.
Column-related methods:
- get_fields: Get the columns (fields) for a table.
Index-related methods:
- get_index_by_name: Get details of an index by name.
Type-related methods:
- get_type: Get details of a specific type.
- get_type_all: Get a list of all types.
Transaction-related methods:
- open_txns: Open a new transaction.
- commit_txn: Commit a transaction.
- rollback_txn: Rollback a transaction.
Other methods:
- get_functions: Get a list of all functions.
- get_config_value: Get the value of a Hive configuration property.

Most of the time however, users interact with the HMS through Spark or Hadoop services such as Hive rather than directly through the Thrift interface.

Spark Catalog Deep Dive

The Spark Catalog (or Catalog API) was first introduced in Spark v2.0, which was released in 2016². The Catalog API provided a standardized API for Spark users to access object metadata in Spark SQL, and removed the direct dependency on the Hive Metastore (HMS), although the HMS has still been the primary backend for the Spark Catalog to persist object metadata. The second version (v2) of the Spark Catalog was introduced in Spark v3.0, which was released in 2020. This significant release introduced a pluggable catalog model which allowed for multi-catalog support and the ability to have external catalog implementations. The Spark Catalog v2 release also added deeper integration with table formats like Iceberg, Delta and Hudi and other improvements.

The Spark Catalog API is now the standard way that Spark users interact with the object metastore, and has been integrated into Spark SQL.

Features

The most important thing to note here is that the Spark Catalog API has a much narrower scope than the Hive Metastore (HMS). The HMS provides support for most features you’d expect from a database system. The Spark Catalog on the other hand only provides support for a limited set of objects and features:

Catalogs
Databases
Tables
Partitions (for partitioned tables)
Columns
Views (temporary, not persisted in metastore)
User-Defined Functions (UDFs)
Caching

Also important to note is that each catalog implementation might support a different set of features. For example, an Iceberg catalog today doesn’t support UDFs. So you need to understand the features supported by the particular catalog implementation you’re using.

Schema

Due to the nature of the pluggable catalog model of the Spark Catalog, and Spark’s support for various types of tables and catalogs, there isn’t a single backend schema used to persist the object metadata. As mentioned above many people use the HMS as the backend metastore and its schema is discussed above. But it is up to each Catalog implementation to define and persist the object metadata. If you’re interested, try asking ChatGPT for a conceptual DDL for the Spark Catalog and it will give you some ideas for a database schema to persist the object metadata.

API

As I mentioned in the Features section above, the scope (feature set) of the Spark Catalog is much narrower than the HMS. And, by extension, so is the API. You can view the current Python Pyspark version of the Spark Catalog API for more details but here is a categorized list of the 27 methods:

Catalog-related methods:
- Catalog.currentCatalog: Returns the current default catalog in this session.
- Catalog.listCatalogs: Returns a list of catalogs in this session.
- Catalog.setCurrentCatalog: Sets the current default catalog in this session.
Database-related methods:
- Catalog.currentDatabase: Returns the current default database in this session.
- Catalog.databaseExists: Check if the database with the specified name exists.
- Catalog.getDatabase: Get the database with the specified name.
- Catalog.listDatabases: Returns a list of databases available across all sessions.
- Catalog.setCurrentDatabase: Sets the current default database in this session.
Table-related methods:
- Catalog.createExternalTable: Creates a table based on the dataset in a data source.
- Catalog.createTable: Creates a table based on the dataset in a data source.
- Catalog.getTable: Get the table or view with the specified name.
- Catalog.listTables: Returns a list of tables/views in the specified database.
- Catalog.tableExists: Check if the table or view with the specified name exists.
Column-related methods:
- Catalog.listColumns: Returns a list of columns for the given table/view in the specified database.
Partition-related methods:
- Catalog.recoverPartitions: Recovers all the partitions of the given table and updates the catalog.
View-related methods:
- Catalog.dropGlobalTempView: Drops the global temporary view with the given view name in the catalog.
- Catalog.dropTempView: Drops the local temporary view with the given view name in the catalog.
Function-related methods:
- Catalog.functionExists: Check if the function with the specified name exists.
- Catalog.getFunction: Get the function with the specified name.
- Catalog.listFunctions: Returns a list of functions registered in the specified database.
- Catalog.registerFunction: An alias for spark.udf.register().
Caching-related methods:
- Catalog.cacheTable: Caches the specified table in-memory or with given storage level.
- Catalog.clearCache: Removes all cached tables from the in-memory cache.
- Catalog.isCached: Returns true if the table is currently cached in-memory.
- Catalog.refreshByPath: Invalidates and refreshes all the cached data (and the associated metadata) for any DataFrame that contains the given data source path.
- Catalog.refreshTable: Invalidates and refreshes all the cached data and metadata of the given table.
- Catalog.uncacheTable: Removes the specified table from the in-memory cache.

Iceberg Catalog Deep Dive

The Iceberg Catalog is probably the most simple database object catalog (or metastore) to understand. Its scope is limited to Iceberg tables and views, and its responsibility is to maintain a list of tables and views along with a pointer to the most recent version of each object. Almost all of the metadata related to an Iceberg table has been moved out of the catalog and into the table format itself. By keeping the catalog very lightweight, and storing the metadata for the table in metadata files along with the data, Iceberg is able to provide incredible flexibility and portability making it the table format of choice today.

When it comes to open source Iceberg catalogs, check out Apache Polaris. Polaris is the first open source Iceberg catalog with centralized, cross-engine security and access. And if you’re interested in an easy way to get started with a hosted version of Polaris, Snowflake now offers one, even if you don’t use Snowflake.

Features

In our discussion of the Spark Catalog features above, it was mentioned that the Spark Catalog has a much narrower scope than the Hive Metastore (HMS). Well the Iceberg Catalog takes that even further by having an even narrower scope. Here are the core objects and features supported by the Iceberg catalog:

Namespaces
Tables (list, current metadata file)
Views
Transaction control (atomic swap)
Misc (metrics, scan planning)

With Iceberg the catalog is primarily focused on tables of data, and the role of the catalog is simply to keep a list of tables along with the current metadata file (metadata.json) for each. It provides transaction control by providing a method to atomically swap the pointer to the current metadata file for a table. Other than tables, the Iceberg catalog also supports views, which were recently added to the catalog in March of this year (with version 1.5.0).

Schema

As with the Spark Catalog, there are many different backend implementations for an Iceberg catalog. Which means that there is no single relational schema used to persist the catalog metadata. However, if you look in the Iceberg source code for the JDBC Catalog, the JdbcUtil.java class does have code to create the two tables it uses to persist the metadata. This provides us with a representative data model used by an Iceberg catalog:

CREATE TABLE iceberg_tables
(
  catalog_name VARCHAR(255) NOT NULL,
  table_namespace VARCHAR(255) NOT NULL,
  table_name VARCHAR(255) NOT NULL,
  metadata_location VARCHAR(1000),
  previous_metadata_location VARCHAR(1000),
  record_type VARCHAR(5),
  PRIMARY KEY (catalog_name, table_namespace, table_name)
)

CREATE TABLE iceberg_namespace_properties
(
  catalog_name VARCHAR(255) NOT NULL,
  namespace VARCHAR(255) NOT NULL,
  property_key VARCHAR(255),
  property_value VARCHAR(1000),
  PRIMARY KEY (catalog_name, namespace, property_key)
)

The first thing that should stand out here is how few tables there are to track the catalog metadata. And the “iceberg_tables” table supports both Iceberg tables and views, using the “record_type” column to distinguish between them.

API

There are two main types of APIs for implementing an Iceberg catalog: Java and REST. With the Java SDK each supported compute service implements its own Iceberg catalog client connector. For the list of provided connectors check out the left hand navigation bar on the main Iceberg documentation page³.

But the really exciting API for Iceberg catalogs is the REST API. The Java client API is tightly coupled with the Iceberg project, and each client’s catalog implementation must be defined and stored along with the Iceberg project source code. But with the REST API the implementation details live on the server side and hidden from the client. This makes it possible that the same client can connect to any service provider’s Iceberg catalog as long as it implements the REST interface! This is conceptually similar to how the Hive Metastore does things using a Thrift API, but REST is a much more modern and accessible way to build APIs.

The Apache Iceberg REST Catalog API is defined using the OpenAPI specification. You can find the latest Iceberg REST OpenAPI specification in the Iceberg source code⁴. Here is a categorized list of the 16 REST endpoints in the API:

Configuration API

/v1/config: List all catalog configuration settings

Catalog API

Namespaces
- /v1/{prefix}/namespaces: List and create namespaces
- /v1/{prefix}/namespaces/{namespace}: Inspect, update and remove namespaces
- /v1/{prefix}/namespaces/{namespace}/properties: Set or remove properties on a namespace
Tables
- /v1/{prefix}/namespaces/{namespace}/tables: List and create tables
- /v1/{prefix}/namespaces/{namespace}/register: Register a table in the given namespace using given metadata file location
- /v1/{prefix}/namespaces/{namespace}/tables/{table}: Inspect, update and remove tables
- /v1/{prefix}/tables/rename: Rename a table
Transactions
- /v1/{prefix}/transactions/commit: Commit updates to multiple tables in an atomic operation
Views
- /v1/{prefix}/namespaces/{namespace}/views: List and create views
- /v1/{prefix}/namespaces/{namespace}/views/{view}: Inspect, update and remove views
- /v1/{prefix}/views/rename: Rename a view
Misc
- /v1/{prefix}/namespaces/{namespace}/tables/{table}/metrics: Send a metrics report to this endpoint to be processed by the backend
- /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan: Submit a scan for planning
- /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}: Fetch the results of or cancel a scan plan
- /v1/{prefix}/namespaces/{namespace}/tables/{table}/tasks: Fetches result tasks for a plan task

Aside: Table Statistics

It’s interesting to note a recent shift in the evolution of table/column statistics in metastores. For the past 50+ years database systems have tracked table and column statistics for the purpose of optimizing query performance. Beginning with IBM’s System R in the 1970s up to today, table/column statistics are critical for efficient query planning. But where those statistics are stored has changed over the years. Let’s take a quick look at the evolution of table statistics from the HMS, to Parquet, to Iceberg.

First, let’s consider the Hive Metastore (HMS). If you jump back up to the schema for the HMS you’ll find two tables used to store statistics (the TAB_COL_STATS and PART_COL_STATS). The schemas are essentially the same for both, so here’s the current DDL for the TAB_COL_STATS table:

CREATE TABLE "TAB_COL_STATS" (
  "CS_ID" bigint NOT NULL,
  "COLUMN_NAME" character varying(767) DEFAULT NULL::character varying,
  "COLUMN_TYPE" character varying(128) DEFAULT NULL::character varying,
  "TBL_ID" bigint NOT NULL,
  "LONG_LOW_VALUE" bigint,
  "LONG_HIGH_VALUE" bigint,
  "DOUBLE_LOW_VALUE" double precision,
  "DOUBLE_HIGH_VALUE" double precision,
  "BIG_DECIMAL_LOW_VALUE" character varying(4000) DEFAULT NULL::character varying,
  "BIG_DECIMAL_HIGH_VALUE" character varying(4000) DEFAULT NULL::character varying,
  "NUM_NULLS" bigint NOT NULL,
  "NUM_DISTINCTS" bigint,
  "BIT_VECTOR" bytea,
  "AVG_COL_LEN" double precision,
  "MAX_COL_LEN" bigint,
  "NUM_TRUES" bigint,
  "NUM_FALSES" bigint,
  "LAST_ANALYZED" bigint NOT NULL,
  "ENGINE" character varying(128) NOT NULL,
  "HISTOGRAM" bytea
);

As you’ve seen, neither the Spark Catalog, nor the Iceberg Catalog, directly capture table/column statistics. Both instead rely on the table implementation itself to store and provide access to that metadata. For example with the Spark Catalog, if the table is a Hive table then the query engine directly accesses the HMS to access the metadata from the schema shown above. With newer file and table formats, the statistics are now stored outside the catalog in the file or table formats directly. This is important because storing the statistics directly in the Hive Metastore RDBMS backend often resulted in performance bottlenecks for query planning.

Next, let’s consider Parquet. The Parquet footer stores metadata about the file, row groups, and columns. Included in the metadata is min/max counts, null/distinct/total counts, compressed/uncompressed size, and more. Here’s a diagram of the structure of a Parquet file which shows this:

Diagram from the Parquet File Format documentation

Finally, let’s look at how Iceberg stores statistics. While the Iceberg table spec allows for different file formats for the data, Parquet is the most commonly used format. But so that Iceberg can be (somewhat) file format agnostic, it does not depend upon the Parquet statistics. Instead Iceberg stores similar statistics inside the metadata files themselves. Both the Manifest Lists and Manifest Files contain summary statistics about the data⁵. Both of those files are stored in the Avro format, but here’s a snippet from a JSON representation of a sample Manifest File (this does not include all the captured statistics):

{
  "status" : 1,
  "snapshot_id" : {
    "long" : 4746546552005205230
  },
  "sequence_number" : null,
  "file_sequence_number" : null,
  "data_file" : {
    "content" : 0,
    "file_path" : "/hadoop_local/namespace1/orders/data/00000–16–4cd5ae90–037c-4c64–97d5-fd31c5667b75–00001.parquet",
    "file_format" : "PARQUET",
    "partition" : {
      "order_ts_hour" : {
        "int" : 466167
      }
    },
    "record_count" : 1,
    "file_size_in_bytes" : 1276,
    "column_sizes" : { …
    },
    "value_counts" : { …
    },
    "null_value_counts" : { …
    },
    "nan_value_counts" : { …
    },
    "lower_bounds" : { …
    },
    "upper_bounds" : { …
    },
    "key_metadata" : null,
    "split_offsets" : { …
    },
    "equality_ids" : null,
    "sort_order_id" : { …
    }
  }
}

To repeat, statistics are still critical for query optimization, but with newer file and table formats, the statistics are now stored outside the catalog in the file or table formats directly.

Conclusion

The database object catalog (or metastore) has been an essential component of databases and data platforms since it was introduced in the 1970s, beginning with IBM’s System R. And the need for metastores hasn’t changed today.

But the scope of features supported by various metastores has changed, and it’s critical to understand which objects and features are supported by each individual catalog. As we’ve seen, the Hive Metastore (or HMS) has evolved to provide support for a wide range of objects important for working with data. But the Spark Catalog and Iceberg Catalog, on the other hand, have a much narrower scope, focusing primarily on tables. We also looked briefly at how table and column statistics are handled by the metastores. The HMS supported those statistics directly by storing them in the backend datastore. But the Spark Catalog and Iceberg Catalog do not directly store that metadata, instead they depend on the underlying table implementation to store and provide access to that metadata.

One of the key takeaways here is that even though all are called metastores (or “catalogs”) the scope of each can be very different. What’s most important in discussions about object “catalogs” is which objects are supported and which features of each object type are supported.

I think there’s one last question that needs to be answered, and you may have already asked it yourself while reading this. That question is, what about the other objects? As we’ve seen the scope of supported database objects by modern Spark and Iceberg Catalogs has shrunk considerably. From System R in the 1970s until Hadoop was founded in the mid-2000s, databases have provided rich, mature data management capabilities. When data lakes were created we lost all of those things. Hive in particular has recovered many of them and used the HMS to persist the metadata, but Spark and Iceberg catalogs only deal primarily with tables. So what about the other objects? They’re certainly important to an enterprise data platform. My next blog post will explore exactly that question, so stay tuned!

Footnotes

[1] For more details, please check out the helpful article The Hive MetaStore and Local Development.
[2] For more details on the Catalog API introduced in Spark 2.0, see the v2.0.0 Scala Catalog API docs.
[3] And for additional details on creating a custom connector check out the Java Custom Catalog page.
[4] To view and interact with the API, try the https://editor.swagger.io/ web tool
[5] The Iceberg spec also has provisions for additional statistics to be tracked in Puffin files, which I don’t cover here.

Evolution of the Open Metastore

Introduction

Hive Metastore Deep Dive

Features

Schema

API

Spark Catalog Deep Dive

Features

Schema

API

Iceberg Catalog Deep Dive

Features

Schema

API

Aside: Table Statistics

Conclusion

Footnotes

Written by Jeremiah Hansen