A New Approach to Database Change Management with Snowflake
Borrowing an expression from Mugatu in Zoolander, “DevOps is so hot right now”! And the database DevOps space in particular is red hot. Like I mentioned in my previous article, A database change management (DCM) tool is critical for managing your Snowflake objects and implementing CI/CD pipelines. But DCM tools have been around for a long time now and aren’t well suited for managing certain types of objects in your database. This blog post will discuss those challenges and present a new way to think about managing your Snowflake objects.
This blog post is the third in a series I’m doing on DevOps with Snowflake. To recap, the first two blog posts in the series are:
- Embracing Agile Software Delivery and DevOps with Snowflake: Discusses the benefits of and unique challenges with Database DevOps, introduces the approaches to Database Change Management, and briefly touches on using cloning and time-travel for DevOps.
- snowchange, a Database Change Management tool: Introduces snowchange, a lightweight open source, Database Change Management tool for Snowflake and discusses how to get started using it.
Approaches to managing Snowflake objects
As I discussed in part 1 of this series, there are two primary approaches for managing your database objects with a Database Change Management (DCM) tool: declarative and imperative. While both approaches have their own pros and cons neither of them are particularly well suited for managing all of your database objects. For example, consider database permissions: you can store the individual GRANT statements in a series of scripts but that is difficult to manage correctly with either approach. Or users and roles, which are better managed from your enterprise identity provider (IdP).
The challenge here is not due to a difference between the declarative or imperative approach, but rather due to the nature of DCM tools. To make this point more clear let’s start by looking at the list of Snowflake objects which must be managed during the development lifecycle and on-going operations of your Snowflake account:
The traditional approach has been to try and manage all of your database objects with a single DCM tool (like snowchange), which can be visualized like this:
But like I mentioned above, some objects don’t fit well into this model (like users, roles and permissions for example). So instead of trying to shoehorn all Snowflake objects into a single tool/process I think we should consider a new approach, one which recognizes the differences between objects and groups of objects. How about something like this instead?
Let’s now look at each group in more detail.
Managing Users and Roles
Trying to manage users and roles with a traditional database change management (DCM) tool is troublesome. The ideal place to manage users and roles from is your existing enterprise Identity Provider (IdP), something like Active Directory or Okta. Maintaining two or more lists of users and roles is error prone and trying to leverage a DCM tool to maintain a list of users and roles (in code) is a pain.
But fortunately Snowflake supports SCIM (or System for Cross-domain Identity Management), which is “an open specification to help facilitate the automated management of user identities and groups (i.e. roles) in cloud applications using RESTful APIs.” In particular Snowflake provides SCIM support for both Azure Active Directory and Okta. Leveraging SCIM you can centrally manage your users and roles in your enterprise IdP and have them automatically kept in sync with Snowflake.
Like with users and roles, trying to manage privileges with a traditional DCM tool is troublesome and error prone. This is especially true for imperative style DCM tools, even with features like repeatable migrations. The best way to manage these privileges is by storing the definitions in a central spot and having a tool/process to deploy them. I’ve seen this done a few different ways:
- Managing the privilege definitions in code and using an open source tool to deploy them. See snowflake-rbgm (Rules Based Grant Management) by James Weakley or permifrost by the GitLab Data Team for two similar approaches.
- Managing the permissions in Excel, exporting that data to a Snowflake table, and then using a custom stored procedure to deploy the privileges.
Both tools mentioned in #1 above provide a declarative approach to managing privileges and both will even revoke privileges not defined in code ensuring that your Snowflake account always matches what’s defined in code. If you can’t be sure that your security rules are deployed consistently and correctly, how can you be sure that your system is secure? A tool like one of these gives you that assurance.
Managing the Remaining Database Objects
The remaining database objects in the diagram above could then be managed by a traditional DCM tool like snowchange. See my previous blog post snowchange, a Database Change Management tool for an overview of how these tools work.
But Wait, there’s More!
At this point you might be asking yourself how novel of an approach this really is, seeing that so far we’ve really only removed the need to manage 3 types of objects (users, roles, grants) out of 18 from our traditional DCM tools. Is this worth the hassle? I would argue that even for just those 3 types of objects it’s well worth the effort, for the reasons described above. But, we’re not done yet because I’ve got one more big idea for you to consider!
What if by using replication tools and dbt (data build tool) we could eliminate the need to manage most of the objects we’ve been traditionally using DCM tools for? Consider this breakdown:
What are “Dataset” Objects?
So what are dataset objects? Dataset objects are any database objects which contain or represent sets/tables of data, such as tables, external tables, views, and streams for Snowflake. Typically these represent the bulk of the database objects that you’re maintaining in code, often along with the stored procedure code to move data between these objects (the “plumbing” described below).
The Challenges with Managing Dataset Objects
In my experience something like 70–80% of data engineering time is spent managing these dataset objects and the plumbing that goes along with them. By plumbing code I mean the DML statements like INSERT, UPDATE, DELETE, and MERGE which are used to move data between tables, either indirectly with ETL/ELT tools or directly with SQL in a stored procedure. With the traditional DCM tool approach you’re responsible for creating each of the dataset objects and the plumbing code. And that requires a huge amount of development and operations effort.
Previously, to make life easier, we would often create a metadata-driven process that would dynamically generate the dataset objects and plumbing code and then leverage a DCM tool to help deploy the code. That approach can work but also requires a lot of development effort to create and maintain. What’s more, that approach becomes much more complicated with an imperative-style DCM tool (see part 1 of this series for a definition of the imperative approach).
A New Approach for Managing Dataset Objects
By using data replication tools and dbt (data build tool) we can eliminate the need to manage most of the objects we’ve been traditionally using DCM tools for. By letting these tools do the tedious heavy lifting, data engineers are freed up to focus on the activities which really drive business value such as analyzing and modeling the data. To understand where these two sets of tools will be used we need to take a quick step back and think about the high level data engineering task. I find it helpful to divide the data engineering task into three distinct phases: data ingestion, data transformation, and data consumption. This can be seen visually like this:
The goal for data ingestion is to get a 1:1 copy of the source into Snowflake as quickly as possible. For this phase we’ll use data replication tools. The goal for data transformation is to cleanse, integrate and model the data for consumption. For this phase we’ll use dbt. And we’ll ignore the data consumption phase for this discussion.
Data Replication Tools for Ingestion
Data replication tools are designed to move data from one place to another while maintaining the same structure and values (1:1 copy of the source). They handle creating the target dataset object (table), managing full load vs incremental loads, performing the upsert operation for data that changes, dealing with schema changes, and monitoring the process. A few of them can even read source data from database log files directly so that they don’t have to query the source database and negatively impact performance. These tools save you an incredible amount of engineering time and allow you to focus on higher value tasks than simply moving data as is between two points.
There are many replication tools that work with Snowflake and I’ll name just a few. For tools that replicate data from relational database sources and read directly from log files consider Qlik Replicate (formerly Attunity Replicate), HVR or Stitch. For tools that support relational databases by querying them directly and API sources consider Fivetran, Stitch or Matillion Data Loader. For tools that replicate streaming data consider the Snowflake Connector for Kafka. All of these tools will handle the heavy lifting for you and remove the need to manage the dataset objects and plumbing code yourself.
dbt (data build tool) for Transformation
Data transformation tools are designed to help you cleanse, integrate and model data for consumption. Almost any ETL/ELT tool on the market can be used for this purpose, but they will all require you to manage the dataset objects yourself and most will also require you to manage the plumbing code yourself with stored procedures or direct SQL queries. As we discussed in the challenges section above this is a costly and time consuming activity.
dbt is the first tool I’ve seen to completely turn the data transformation process on its head. Instead of spending most of your time managing dataset objects and the associated plumbing code, dbt allows you to focus on the value added data transformation activities. For a good overview of dbt and how to use it with Snowflake check out Using DBT to Execute ELT Pipelines in Snowflake by the Hashmap team (and see also What is dbt?). The key thing to understand is that the dbt materializations take care of managing the dataset objects and plumbing code for you. Additionally, it’s easy to create custom materializations that meet your specific needs. I plan to write a blog post on this exact topic soon.
Let’s quickly review where we’re at with managing our Snowflake objects:
A database change management (DCM) tool is still critical for managing many of our Snowflake objects and implementing CI/CD pipelines. But DCM tools have been around for a long time now and as we discussed aren’t well suited for managing all types of database objects.
By using replication tools and dbt (data build tool) we can eliminate the need to manage most of the objects we’ve been traditionally using DCM tools for. With 70–80% of the code for dataset objects no longer created by hand (or through code generation processes) and no longer being managed in a traditional DCM tool, data engineers are freed up to focus on the activities which create the most business value. This translates into cost savings as well as delivering more value to the business faster. It’s time to start thinking differently about how we manage our Snowflake objects!