What is DBT — Beginners Guide

Sandun Dayananda
10 min readApr 19, 2024

--

A simple and full introduction to DBT

Image source-getdbt.com
Image source-getdbt.com

Introduction

In today’s world of data engineering, there’s a tool called Data Build Tool (DBT) that’s making a big splash. It’s like a superhero for data analysts and engineers, helping them work magic with their data. But what exactly is DBT, and how does it change the way we handle data? Let’s find out!

What is DBT?

DBT, or Data Build Tool, is like a special toolbox for transforming data. It’s free for anyone to use and makes it easy to change and organize data in your database. With DBT, you can tell your database what to do with your data using SQL(Also, Python can be used now in DBT). It’s like giving your database(DBMS, cloud databases etc.) a recipe to cook up exactly the data you need. So, DBT is used to prepare the recipe and then handover to database to cook.

Why DBT?

DBT is awesome because it brings the best practices of building software to the world of data. It helps keep things organized and makes sure your data is good quality. Also, keep in mind that DBT is best for batch processing.

Image source-getdbt.com
Image source-getdbt.com

DBT isn’t your average data tool; it’s like having a data superhero on your team. Here’s why it’s so awesome:

1. Smarter Work, Less Hassle: DBT makes data work feel like a breeze. It’s all about applying the smart techniques used in software development to your data tasks. This means your data stays organised, reliable, and easy to manage.

2. Team Collaboration Made Easy: With DBT, everyone on your team can join forces seamlessly. Because CI/CD can be implemented easily with DBT. Its user-friendly features make it simple for data analysts, engineers, and even non-tech folks to work together smoothly. Say goodbye to misunderstandings and hello to efficient teamwork!

3. Works Anywhere, Anytime: Whether you’re using BigQuery, Snowflake, or Redshift, DBT plays nice with all major data platforms. No more compatibility headaches; DBT seamlessly integrates with your setup so you can focus on what matters most — getting insights from your data.

Image source-getdbt.com
Image source-getdbt.com

4. Data Quality You Can Trust: DBT doesn’t just transform data; it ensures it’s top-notch too. With built-in testing tools, you can rest easy knowing your data is accurate and reliable. It’s like having a built-in data watchdog, sniffing out errors before they become a problem.

Core Concepts of DBT

Models: Think of models as tiny helpers. When we create a model, it can be either table or view. Each one has a specific job to do, like sorting out a particular type of data. By breaking down big tasks into smaller ones, DBT makes it easier to manage and understand what’s happening with your data.

DAG (Directed Acyclic Graph): This sounds complicated, but it’s just a fancy way of saying DBT helps you see how different tasks(tables or views) are connected. It’s like drawing lines between puzzle pieces to show how they fit together. With DBT, you can see the big picture of how your data is moving and changing.

Image source-getdbt.com
Image source-getdbt.com

DBT uses ref()function to create a directed acyclic graph(DAG) of computation internally. This allows DBT to understand dependencies between each models and understand the execution order. Also, when you run your all DBT models, using dbt runit will execute each model in the proper order considering all the dependencies. You will see it like in the following image. Running one by one perfectly.

Image source — docs.getdbt.com
Image source — docs.getdbt.com

Also, you can tag DBT models. It would be really helpful when you need to run a specific group of models.

You can add tags to your models in the dbt_project.yml file or in the model SQL file itself.

Here’s an example of how you can tag models in the dbt_project.yml file:

models:
my_project:
marketing:
schema: marketing
tags: ["daily", "ETL"]
finance:
schema: finance
tags: ["hourly", "ETL"]

In this example, the marketing and finance models are tagged with daily, hourly, and ETL respectively.

You can also tag models directly in the model SQL file using the config block:

{{ config(
materialized = 'table',
schema = 'my_schema',
tags = ['daily', 'ETL']
)}}

Once you’ve tagged your models, you can run a specific group of models by using the --models flag with the tag: prefix. For example, to run all models tagged with daily, you would use the following command:

dbt run --models tag:daily

This command will run all models tagged with daily in the order defined by their dependencies. This is particularly useful when you have a large number of models and want to run a specific subset of them.

Tests: DBT has a built-in inspector that checks your data to make sure it’s good quality. It’s like having a super-smart friend who can spot mistakes before they cause problems. This helps you trust your data more and make better decisions.

So, don’t be afraid to dive into DBT and see what wonders it can do for you. Your data adventure awaits!

Now, You have the idea about DBT and how it can be useful. Now, I’m going to explain above mentioned things more practically for you to uncover the exact nature of DBT.

There are 2 versions of DBT:

  1. DBT Core (Open Source): DBT Core, also known as DBT Open Source, is the free and open-source version of DBT. Models can be run via command line tool. It provides all the essential features for building and managing data transformation pipelines. DBT Core is widely used by data teams of all sizes and is maintained by the open-source community.
  2. DBT Cloud: DBT Cloud is the hosted version of DBT, offered as a Software as a Service (SaaS) platform by Fishtown Analytics, the company behind DBT. DBT Cloud provides additional features and services on top of DBT Core, including collaboration tools, scheduling, monitoring, and enterprise-grade support. It offers a convenient and scalable solution for teams looking to streamline their data transformation workflows in the cloud.

You can easily setup DBT core in your computer or you can just create a free DBT cloud account. With free DBT cloud account you can easily connect to Github and maintain your codebase. Also, you can easily connect to the database as you prefer. Following is the DBT cloud interface(IDE).

DBT cloud — Image by author
DBT cloud — Image by author

In the above picture you can see there is a directory structure. It comes together with DBT. If you have setup DBT core in you local machine, the directory structure is same.

DBT projects typically have a specific directory structure that helps organize the codebase. Here’s a high-level overview:

dbt_project/
├── dbt_project.yml
├── analysis/
├── seeds/
├── macros/
├── models/
│ ├── my_new_project/
│ │ ├── example_model.sql
├── snapshots/
├── tests/
└── target/

Let’s break down what each of these directories and files typically contains:

  • dbt_project.yml: This is the main configuration file for your DBT project. It specifies the name of the project, the version of DBT you’re using, database names, schema names and other configuration options.
  • analysis/: This directory is where you can put any exploratory SQL scripts. Files in this directory won’t be run as part of dbt run.
  • seeds/: This directory is for storing static data that your project uses. This could be CSV files, for example.
  • macros/: This directory is for DBT macros. Macros are reusable pieces of SQL code that you can use across your project.
  • models/: This is where the main SQL transformations are stored. Each file in this directory (or its subdirectories) represents a single DBT model.
  • snapshots/: This directory is for DBT snapshots. Snapshots capture the state of your data at different points in time and are useful for tracking changes over time.
  • tests/: This directory is for data tests, which are assertions about your data that DBT can check for you.
  • target/: This directory is automatically generated when you run your DBT project. It contains the compiled SQL code and the results of your DBT run.

Remember, this is just a typical DBT project structure. Depending on your specific needs and the complexity of your project, your structure might look a bit different. But in general, this structure provides a good starting point for organising your DBT project.

So, basically you put your sql scripts inside the models folder. When you create the script, the script name is going to be the name of the target model(table/view). Also, we can tell the script what it should make either table or view. As well as we can tell the script to make an incrementally updating table, ephemeral etc. This is called materialization.

In DBT, materializations are methods that define how DBT will transform your SQL models into tables or views in your data warehouse. They control how DBT executes the SQL code in your models and what gets left behind in your database after DBT runs.

There are several types of materializations in DBT:

  1. Table: This materialization creates a table from your SQL model. Each time you run DBT, it will drop the existing table and recreate it with the new data.
  2. View: This materialization creates a view from your SQL model. Each time you run DBT, it will replace the existing view with the new one.
  3. Incremental: This materialization is used for large tables where rebuilding the entire table would be too time-consuming or resource-intensive. It builds a table the first time it runs, and then on subsequent runs, it only adds new records.
  4. Ephemeral: This materialization doesn’t create a table or a view in your database. Instead, it gets compiled into the SQL code of other models that select from it. Ephemeral models can be useful for breaking up complex transformations into smaller pieces.

You can specify the materialization for each model in your DBT project by using the {{ config(materialized='...') }} macro at the top of your model’s SQL file, replacing ‘…’ with the type of materialization you want to use. Also, you can define materialization in dbt_project.yml file. But when you define it in the script, the materialization in the .yml file will be overridden.

Remember, the choice of materialization depends on your specific use case and the trade-offs you’re willing to make between data freshness, query performance, and resource usage. It’s one of the key decisions you’ll make when designing your DBT project.

If you want to reference an existing view/table(not a dbt model) in your database, you would typically define it as a source in a .yml file and then use the source() function to reference it in your SQL scripts.

Firstly, you need to define your source table/view in a .yml file. This is typically done in a file in the models/ directory. Here’s an example:

# models/sources.yml
version: 2
sources:
- name: my_source
database: my_database
schema: my_schema
tables:
- name: my_table

In this example, my_source is the name of the source, my_database is the name of your database, my_schema is the name of the schema where your table resides, and my_table is the name of the table.

Once you’ve defined your source table in a .yml file, you can select from it in a .sql script like this:

-- models/example.sql
{{ config(materialized='view') }}

SELECT *
FROM {{ source('my_source', 'my_table') }}

In this example, my_source is the name of the source you defined in your .yml file, and my_table is the name of the table in that source. The {{ source('my_source', 'my_table') }} macro is a special DBT function that gets replaced with the actual database, schema, and table name when DBT runs.

And that’s it! You’ve now defined a source table in a .yml file and selected from it in a .sql script in DBT.

However, if you don’t want to use a .yml file, you can reference the view directly in your SQL scripts using its fully qualified name, like this:

SELECT *
FROM my_database.my_schema.my_view

In this example, my_database is the name of your database, my_schema is the name of the schema where your view resides, and my_view is the name of the view.

Remember, this method bypasses some of the benefits of using DBT sources, such as automatic freshness checks and the ability to swap out sources for testing. But it can be a quick way to get started if you’re just exploring DBT and don’t want to set up a full project structure yet. When you’re ready to start building a production-grade DBT project, it’s recommended to define your sources in a .yml file to take full advantage of DBT’s features. But for quick exploration and prototyping, directly referencing your tables and views can be a useful shortcut.

If you want to use an existing view/table in DBT as a source the, then you can directly refer as following using ref() function instead of {{ source() }}

The ref() function in DBT is used to reference other models within your DBT project, not existing tables or views in your database. When you use ref(), DBT will replace it with the appropriate database, schema, and table name for the model you’re referencing.

-- models/example.sql
{{ config(materialized='view') }}

SELECT *
FROM {{ ref('my_other_model') }}

Let’s say in your DBT repository you have several models. Some of them are materialized as tables and some of them are materialized as views. In this scenario, you need to run all the models which are materialized as tables again and again to keep the target table updated. Because underlying data can be changed. For the models which are materialized as views this is not mandatory since they don’t actually store data and when we requested they serve updated data.( This can be vary according to the underlying data and table structure.)

If you think you have learned something, please follow me to see more articles like this..!

--

--

Sandun Dayananda

Big Data Engineer with passion for Machine Learning and DevOps | MSc Industrial Analytics at Uppsala University