Blog

How We Migrated From dbt Cloud and Scaled Our Data Development

This decision has not only increased productivity but also empowered us to self-serve our own features and address bugs efficiently.

Alejandro Rojas
April 5, 2024
Start your 14-day free trial with GlossGenius today!
Start free trial

No credit card required.

In the realm of modern data analytics, dbt has been a game changing framework for setting up data projects. As dbt is defined in its official website, it is “a workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices”. GlossGenius is not an exception; dbt is our de facto transformation tool. We follow a principle of encapsulating and standardizing 100% of transformations within its framework and leverage those in downstream applications like looker, HEX and reverse ETLs.

As with any tool, we needed a way to use it, and we started with one of the fastest and easiest ways of doing it: dbt Cloud, which is a managed service. They include multiple useful features like scheduling, collaboration tools (git through UI), hosting dbt docs and an interactive web IDE.

However, as our team size and use cases scaled, we encountered certain limitations. The platform underwent a pricing model change, which increased our annual costs by a considerable amount. Additionally, we experienced instability with minor services that inundated our alerts channels, and we found ourselves in search of more robust scheduling capabilities.

In this blog post, we will share our journey of migrating away from dbt Cloud – exploring the reasons behind this decision, the alternatives we considered and the lessons we learned along the way.

Try GlossGenius free for 14 days!

Problem Statement

Our setup with dbt Cloud was straightforward and aligned with the recommended best practices. The entire data team had developer access within the tool, letting them make changes to any model and triggering ad hoc jobs.

One notable aspect from our setup was the adoption of slimCI, a lighter continuous integration feature in dbt Cloud. The CI process checks for modified models and their downstreams to avoid re-running the entire project every time a developer wants to make changes. Despite this useful feature, we faced growing issues with the platform’s stability, leading to frequent disruptions and alerts inundating our channels.

Moreover, our job scheduling relied heavily on a CRON-based system, which offered limited flexibility and scalability. These cons became more apparent as our data operations grew, making it challenging to link multiple schedules based on execution marks or rerun failed jobs without re-executing the entire project. These limitations highlighted the need for a more adaptable scheduling solution.

We still found workarounds around the CRON limitations like creating an ad-hoc job that any developer could trigger at any time after making a merge. This approach allowed models to be updated immediately after code changes or rerun after a failure. While we explored additional features like SQLFluff linting, we did not fully utilize them in our dbt Cloud environment.

One of the primary drivers behind exploring alternatives and moving off dbt Cloud was their pricing model change announcement, transitioning from a seat-based model to a consumption-based model. This change meant a significant departure from our current contract, prompting us to reevaluate our options and consider alternatives that could offer a more cost-effective and scalable solution.

[CTA_MODULE]

Evaluation of Alternatives

We evaluated multiple alternatives to determine the best fit solution for our data and analytics needs. 

In our evaluation process, we focused on the below key criterias to assess each option:

  • Ease of use: We wanted to keep the ease of use that dbt Cloud offered to our developers.
  • Ease of setup: Making a smooth transition from one tool to another was crucial to keep the data team’s speed in new developments and insights delivery.
  • Feature parity: We focused on maintaining key features like slimCI, dbt docs hosting and job scheduling.
  • Pricing: The new solution had to be cost effective, i.e, cost should not increase linearly as the team and models scaled.

After evaluations, we decided to migrate to dbt-core and schedule jobs using an orchestration tool such as Airflow.

Embracing Apache Airflow for Orchestration

We opted for Apache Airflow as our orchestration tool, a widely-used solution in modern data stacks globally. Leveraging our existing use of dbt for organizing data models and pipelines, we embraced the dbt-parser for Airflow called cosmos. This framework streamlined execution by treating each dbt model and its tests as individual Airflow tasks, facilitating reruns from failure states and using custom callback functions, inherent to Airflow.

Adopting a comprehensive orchestrator like Airflow enhanced our data stack’s monitoring capabilities, letting us customize the slack notifications with additional information like the model that failed or the specific error that got raised. Also, centralizing job execution within this tool eliminated the need for multiple scheduling platforms, allowing us to create multiple dbt dags, manage various data ingestions, sending up files for our self-hosted dbt docs and implement weekly cleanup processes.

Hosting dbt Docs Without dbt Cloud

Ensuring access to dbt documentation for all users became a priority for us, as dbt docs helped downstream consumers discover and understand the datasets we curated within dbt. This accessibility was a challenge with dbt Cloud, where accessing the docs required a seat in our account. To address this, we opted to host docs in an S3 bucket. 

We implemented a solution that involved creating a Content Delivery Network (CDN) connected to our Single Sign-On (SSO) provider. Upon successful authentication, users are directed to a static website hosting the dbt docs. This approach ensures seamless access to the entire company without the need for individual account seats, thereby reducing costs while maintaining security and accessibility.

Custom SlimCI with Github Actions

To migrate off from dbt cloud’s slim CI functionality, we had to think of a way to replicate its functionality with our current CI/CD tool: Github Actions. To enable slimCI, the following are required:

  • A project manifest file
  • A server / instance to execute dbt commands
  • Access to our data warehouse

And we followed this steps in our CI workflow:

  • Download the manifest from S3
  • Set up python environment using poetry
  • Execute dbt using the state_modified+ command
  • After closing the PR: Delete the temporary schema

Our prior preparations made the emulation of slimCI fast and easy, allowing us to incorporate additional flexibility by creating our own jobs. This approach ensures that slimCI runs always on the latest version of the manifest, instead of the morning’s execution one. Now we avoid running multiple times the same model when developers merge numerous changes during the day.

[CTA_MODULE]

Planning and Executing the Migration

For feature parity with dbt Cloud, we needed to decide on which tools we would deploy for the entire team to use along with dbt core:

  • As an IDE we decided to use visual studio code, which is one of the most popular IDEs of the moment and they have a lot of useful dbt-related extensions that mimicked some of dbt Cloud’s features.
  • Since dbt-core runs in python, we can recall one of the most frequent python developers issues: Making sure local development environments are consistent. For this, we decided to move with poetry to package our project environment, along with pyenv on each developer’s local setups.
  • Using github from visual studio code’s UI or directly from the terminal depending on each developer’s preferences.

We set up a number of different documentation resources / how to guides in the company’s notion page, so the data team could ramp up on new tooling skills and processes with training and documentation:

  • Setting up your dbt python environment using poetry.
  • Setting up vs code with all of the data team’s suggested extensions.
  • Setting up git to work in your terminal.

We held weekly office hours led by the data engineering team and created a dedicated slack channel to troubleshoot any potential user specific issues. Our documentation resources were regularly updated with errors encountered by team members along their resolutions. We could say that one of our biggest challenges was handling the ARM chips, which sometimes exposed weird errors that changed heavily with A1 and A2 chips.

This process was aimed to be a smooth transition, we didn’t want to affect the data developers development speed. It took around 1 month for 90% of the team to effectively move out of dbt Cloud and after a few weeks we were confident enough to remove all developer seats from the platform.

We kept the dbt Cloud scheduler until we finished deciding and deploying our orchestration tool, so as to ensure a smooth transition and minimal disruption to our workflows. Below is our new architecture with dbt core:

Post-Migration Experience

It’s been almost a year since we moved out of dbt Cloud. The overall team feeling is positive, we have been able to even increase our productivity to higher levels. With the help of even more extensions such as Jinja syntax highlighting, Easier git file exploration, SQLFluff and yaml linters. Not forgetting to mention Airflow again, which is constantly helping us glue together services and processes from the data team, feeding ingestions, transformations and even data exports to other services, enhancing our overall efficiency.

Plus other cool features in our development workflow like including pre-commit checks from montreal analytics’ post. We know dbt Cloud is constantly evolving and releasing new features, but us as a data team, we are always looking for ways to improve our own data stack.

In conclusion, the transition to dbt-core has been a journey of continuous learning and improvement for our data team. While exploring alternatives and migrating away from dbt Cloud presented its challenges, we have found value in incorporating elements of the open-source stack into our data platform. This decision has not only increased productivity but also empowered us to self-serve our own features and address bugs efficiently. Moreover, we prioritized scalability and long-term decision-making, recognizing that the convenience of short-term benefits from dbt Cloud had to be balanced with the need for a flexible and sustainable data analytics infrastructure. As we continue on this path, we embrace the flexibility, reliability and faster iteration when we own the infrastructure, shaping our data analytics processes for even more success.

Try GlossGenius free for 14 days!

A Note From Our Team

From Katie Bauer, Head of Data at GlossGenius

When I joined GlossGenius in the summer of 2022, one of the things that excited me most was getting to build a best-in-class Data team. I’ll grant that phrases like “best-in-class” can be a bit nebulous, so I’ll be a bit more specific – I wanted to build a team that cared about driving results. As a data professional I believe in the value of quantifying your business and product, but I am adamant that that value can’t be fully realized unless it’s in service of your company’s goals and long term success.

Of course, that’s easier said than done. Data work is a combination of a lot of different things – technology, process, and even culture building. We’ve invested in all three over the past two years, and I’m incredibly proud of all the ways we’ve grown and matured in that time, as well as of the great many things that we’ve accomplished. Our team’s work has played a key role in finding opportunities and chasing them down – GlossGenius is a data-loving company, and it wouldn’t look the same without us being here.

This blog is meant to showcase how we’ve done that. Stay tuned – there’s lots of good stuff coming!

Try GlossGenius free for 14 days!

Sign up
No credit card required.

Try GlossGenius free for 14 days!

Sign up
No credit card required.

Join Our Genius Newsletter

Get the latest articles, inspiring how-to’s, and educational workbooks delivered to your inbox.

Blog

How We Migrated From dbt Cloud and Scaled Our Data Development

Alejandro Rojas
April 5, 2024

In the realm of modern data analytics, dbt has been a game changing framework for setting up data projects. As dbt is defined in its official website, it is “a workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices”. GlossGenius is not an exception; dbt is our de facto transformation tool. We follow a principle of encapsulating and standardizing 100% of transformations within its framework and leverage those in downstream applications like looker, HEX and reverse ETLs.

As with any tool, we needed a way to use it, and we started with one of the fastest and easiest ways of doing it: dbt Cloud, which is a managed service. They include multiple useful features like scheduling, collaboration tools (git through UI), hosting dbt docs and an interactive web IDE.

However, as our team size and use cases scaled, we encountered certain limitations. The platform underwent a pricing model change, which increased our annual costs by a considerable amount. Additionally, we experienced instability with minor services that inundated our alerts channels, and we found ourselves in search of more robust scheduling capabilities.

In this blog post, we will share our journey of migrating away from dbt Cloud – exploring the reasons behind this decision, the alternatives we considered and the lessons we learned along the way.

Try GlossGenius free for 14 days!

Problem Statement

Our setup with dbt Cloud was straightforward and aligned with the recommended best practices. The entire data team had developer access within the tool, letting them make changes to any model and triggering ad hoc jobs.

One notable aspect from our setup was the adoption of slimCI, a lighter continuous integration feature in dbt Cloud. The CI process checks for modified models and their downstreams to avoid re-running the entire project every time a developer wants to make changes. Despite this useful feature, we faced growing issues with the platform’s stability, leading to frequent disruptions and alerts inundating our channels.

Moreover, our job scheduling relied heavily on a CRON-based system, which offered limited flexibility and scalability. These cons became more apparent as our data operations grew, making it challenging to link multiple schedules based on execution marks or rerun failed jobs without re-executing the entire project. These limitations highlighted the need for a more adaptable scheduling solution.

We still found workarounds around the CRON limitations like creating an ad-hoc job that any developer could trigger at any time after making a merge. This approach allowed models to be updated immediately after code changes or rerun after a failure. While we explored additional features like SQLFluff linting, we did not fully utilize them in our dbt Cloud environment.

One of the primary drivers behind exploring alternatives and moving off dbt Cloud was their pricing model change announcement, transitioning from a seat-based model to a consumption-based model. This change meant a significant departure from our current contract, prompting us to reevaluate our options and consider alternatives that could offer a more cost-effective and scalable solution.

[CTA_MODULE]

Evaluation of Alternatives

We evaluated multiple alternatives to determine the best fit solution for our data and analytics needs. 

In our evaluation process, we focused on the below key criterias to assess each option:

  • Ease of use: We wanted to keep the ease of use that dbt Cloud offered to our developers.
  • Ease of setup: Making a smooth transition from one tool to another was crucial to keep the data team’s speed in new developments and insights delivery.
  • Feature parity: We focused on maintaining key features like slimCI, dbt docs hosting and job scheduling.
  • Pricing: The new solution had to be cost effective, i.e, cost should not increase linearly as the team and models scaled.

After evaluations, we decided to migrate to dbt-core and schedule jobs using an orchestration tool such as Airflow.

Embracing Apache Airflow for Orchestration

We opted for Apache Airflow as our orchestration tool, a widely-used solution in modern data stacks globally. Leveraging our existing use of dbt for organizing data models and pipelines, we embraced the dbt-parser for Airflow called cosmos. This framework streamlined execution by treating each dbt model and its tests as individual Airflow tasks, facilitating reruns from failure states and using custom callback functions, inherent to Airflow.

Adopting a comprehensive orchestrator like Airflow enhanced our data stack’s monitoring capabilities, letting us customize the slack notifications with additional information like the model that failed or the specific error that got raised. Also, centralizing job execution within this tool eliminated the need for multiple scheduling platforms, allowing us to create multiple dbt dags, manage various data ingestions, sending up files for our self-hosted dbt docs and implement weekly cleanup processes.

Hosting dbt Docs Without dbt Cloud

Ensuring access to dbt documentation for all users became a priority for us, as dbt docs helped downstream consumers discover and understand the datasets we curated within dbt. This accessibility was a challenge with dbt Cloud, where accessing the docs required a seat in our account. To address this, we opted to host docs in an S3 bucket. 

We implemented a solution that involved creating a Content Delivery Network (CDN) connected to our Single Sign-On (SSO) provider. Upon successful authentication, users are directed to a static website hosting the dbt docs. This approach ensures seamless access to the entire company without the need for individual account seats, thereby reducing costs while maintaining security and accessibility.

Custom SlimCI with Github Actions

To migrate off from dbt cloud’s slim CI functionality, we had to think of a way to replicate its functionality with our current CI/CD tool: Github Actions. To enable slimCI, the following are required:

  • A project manifest file
  • A server / instance to execute dbt commands
  • Access to our data warehouse

And we followed this steps in our CI workflow:

  • Download the manifest from S3
  • Set up python environment using poetry
  • Execute dbt using the state_modified+ command
  • After closing the PR: Delete the temporary schema

Our prior preparations made the emulation of slimCI fast and easy, allowing us to incorporate additional flexibility by creating our own jobs. This approach ensures that slimCI runs always on the latest version of the manifest, instead of the morning’s execution one. Now we avoid running multiple times the same model when developers merge numerous changes during the day.

[CTA_MODULE]

Planning and Executing the Migration

For feature parity with dbt Cloud, we needed to decide on which tools we would deploy for the entire team to use along with dbt core:

  • As an IDE we decided to use visual studio code, which is one of the most popular IDEs of the moment and they have a lot of useful dbt-related extensions that mimicked some of dbt Cloud’s features.
  • Since dbt-core runs in python, we can recall one of the most frequent python developers issues: Making sure local development environments are consistent. For this, we decided to move with poetry to package our project environment, along with pyenv on each developer’s local setups.
  • Using github from visual studio code’s UI or directly from the terminal depending on each developer’s preferences.

We set up a number of different documentation resources / how to guides in the company’s notion page, so the data team could ramp up on new tooling skills and processes with training and documentation:

  • Setting up your dbt python environment using poetry.
  • Setting up vs code with all of the data team’s suggested extensions.
  • Setting up git to work in your terminal.

We held weekly office hours led by the data engineering team and created a dedicated slack channel to troubleshoot any potential user specific issues. Our documentation resources were regularly updated with errors encountered by team members along their resolutions. We could say that one of our biggest challenges was handling the ARM chips, which sometimes exposed weird errors that changed heavily with A1 and A2 chips.

This process was aimed to be a smooth transition, we didn’t want to affect the data developers development speed. It took around 1 month for 90% of the team to effectively move out of dbt Cloud and after a few weeks we were confident enough to remove all developer seats from the platform.

We kept the dbt Cloud scheduler until we finished deciding and deploying our orchestration tool, so as to ensure a smooth transition and minimal disruption to our workflows. Below is our new architecture with dbt core:

Post-Migration Experience

It’s been almost a year since we moved out of dbt Cloud. The overall team feeling is positive, we have been able to even increase our productivity to higher levels. With the help of even more extensions such as Jinja syntax highlighting, Easier git file exploration, SQLFluff and yaml linters. Not forgetting to mention Airflow again, which is constantly helping us glue together services and processes from the data team, feeding ingestions, transformations and even data exports to other services, enhancing our overall efficiency.

Plus other cool features in our development workflow like including pre-commit checks from montreal analytics’ post. We know dbt Cloud is constantly evolving and releasing new features, but us as a data team, we are always looking for ways to improve our own data stack.

In conclusion, the transition to dbt-core has been a journey of continuous learning and improvement for our data team. While exploring alternatives and migrating away from dbt Cloud presented its challenges, we have found value in incorporating elements of the open-source stack into our data platform. This decision has not only increased productivity but also empowered us to self-serve our own features and address bugs efficiently. Moreover, we prioritized scalability and long-term decision-making, recognizing that the convenience of short-term benefits from dbt Cloud had to be balanced with the need for a flexible and sustainable data analytics infrastructure. As we continue on this path, we embrace the flexibility, reliability and faster iteration when we own the infrastructure, shaping our data analytics processes for even more success.

Try GlossGenius free for 14 days!

A Note From Our Team

From Katie Bauer, Head of Data at GlossGenius

When I joined GlossGenius in the summer of 2022, one of the things that excited me most was getting to build a best-in-class Data team. I’ll grant that phrases like “best-in-class” can be a bit nebulous, so I’ll be a bit more specific – I wanted to build a team that cared about driving results. As a data professional I believe in the value of quantifying your business and product, but I am adamant that that value can’t be fully realized unless it’s in service of your company’s goals and long term success.

Of course, that’s easier said than done. Data work is a combination of a lot of different things – technology, process, and even culture building. We’ve invested in all three over the past two years, and I’m incredibly proud of all the ways we’ve grown and matured in that time, as well as of the great many things that we’ve accomplished. Our team’s work has played a key role in finding opportunities and chasing them down – GlossGenius is a data-loving company, and it wouldn’t look the same without us being here.

This blog is meant to showcase how we’ve done that. Stay tuned – there’s lots of good stuff coming!

Download Now

Thank you for downloading our free template
Check your email- your download is on the way!

Try GlossGenius free for 14 days!

Sign up
No credit card required.

Try GlossGenius free for 14 days!

Sign up
No credit card required.

Join Our Genius Newsletter

Get the latest articles, inspiring how-to’s, and educational workbooks delivered to your inbox.

Download Now

Thank you for downloading our free template
Check your email- your download is on the way!
Get five-star service & support
Get complimentary transfer of your books and customer service that actually picks up the phone.
start free trial
No credit card required.