dbt — All Basic Concepts in 10 Minutes
Introduction
dbt is a great tool for data transformations, which is widely used among data engineers and analytics engineers. It allows data professionals to automate most (if not all) data manipulation tasks in the data warehouses. In this article, I want to provide an overview of the basic concepts in the dbt.
In terms of this article, I will use a simple dataset, which is provided as a sample schema in BigQuery. This dataset has prices for avocados in different years and regions. It’s not very big and works just well for a test example. You can also find this dataset on Kaggle.
You can start with this tool using dbt Cloud and create a developer account for free. As I can see, this is the easiest and fastest way to get started with this tool. I will use dbt in conjunction with BigQuery, you can see in detail, how to provide a connection between these two in my previous article.
Note: All code is available on my GitHub. You are more than welcome to check and use it in your projects! Also, it includes developments and code for more complex ideas in the tool, so you can find everything here.
Concepts
In this article, we will provide an overview of these important concepts in the dbt project:
- Models
- Seeds
- Sources
- Snapshots
- Tests (generic and singular)
Models
In the dbt-core workflow, models are the foundation. Essentially, a model is a SQL statement that is saved to a .sql file. After creating an empty dbt project, a models folder can be found along with other folders that will be used in the following steps.
To start, it’s recommended to create folders for fact and dimensional tables. This article won’t cover dimensional models, but in brief, smaller dimensional tables are created and joined to the fact table with a foreign key.
The next step is to create the first model. This model is simple and extracts a few rows from the raw dataset and stores them in a new table. At this point, all necessary transformations can be performed, and preprocessed values can be stored in the fact table.
WITH raw_avocado AS (
SELECT *
FROM `winter-cocoa-336514.avocado_data.avocado_prices`
)
SELECT
date,
year,
region as origin_region,
AveragePrice as avg_price,
Total_Volume as total_volume,
Total_Bags as total_bags
FROM raw_avocado
After creating and saving the script in a file (mine is named fct_avocado.sql), type the command “dbt run” to create a new instance in your BigQuery.
dbt run
By default, dbt will treat this model as a view in your data warehouse, but you can change the materialization level to use a different form. There are four materialization levels in dbt:
- View — similar to the “create view as” command in SQL. No data is stored, but the data warehouse runs a query each time you use this view. However, this form may not work well when you query a lot and could be very slow.
- Table — the same as the “create table as” statement. This is the standard one, which just creates a table and data from the performed query
- Incremental — inserts and updates new records into a table. This is a widely used form when transferring data from raw format to transformed on a daily basis.
- Ephemeral — it’s just a code. It is not directly built into the database but rather exists in the dbt project. So, it’s just used as an alias for other models.
Now, let’s create our dimensional table, this time add code to materialize them as tables. We can see that Jinja syntax is widely used in dbt models and other concepts as well. For example, we can refer to another model using ref(‘src_avocado’) in the FROM statement. As you work with dbt you will find that this syntax is very useful.
{{
config(
materialized='table'
)
}}
WITH src_avocado AS (
SELECT * FROM {{ ref('src_avocado')}}
), t1 AS (
SELECT DISTINCT(year) as year
FROM src_avocado
)
SELECT year, CAST(ROW_NUMBER () OVER(ORDER BY year) AS INT64) AS id
FROM t1
{{
config(
materialized='table'
)
}}
WITH src_avocado AS (
SELECT * FROM {{ ref('src_avocado')}}
), t1 AS (
SELECT DISTINCT(origin_region) as origin_region
FROM src_avocado
)
SELECT origin_region, CAST(ROW_NUMBER () OVER(ORDER BY origin_region) AS INT64) AS id
FROM t1
This is an example of how we can run the incremental level. It doesn’t work well on our dataset, because we don’t have any new added rows. But the idea is to add new rows when data is after the last one in the table.
{{
config(
materialized='incremental',
on_schema_change='fail'
)
}}
WITH src_avocado AS (
SELECT *
FROM {{ ref('src_avocado')}}
)
SELECT date, avg_price, total_volume, total_bags, y.id as year_id, r.id as region_id
FROM src_avocado a
JOIN {{ ref('dim_year') }} as y
ON a.year = y.year
JOIN {{ ref('dim_region') }} as r
ON a.origin_region = r.origin_region
{% if is_incremental() %}
AND a.date > (select max(a.date) from {{ this }})
{% endif %}
When we run the command dbt run, the new tables were created and we can find them in our BigQuery.
Seeds
Seeds are like CSV files that can be added to a dbt project and loaded into the data warehouse. They are useful when you have data only on your local machine and want to add a new table to your database from a CSV file. To do this, simply add the CSV file to the “seed” folder in your project and enter the appropriate command.
dbt seed
Once you’ve done it, you can see in the logs that it was successfully uploaded.
I can also see this newly-created table in the BigQuery.
Sources
Sources can make your project more flexible, you can name and describe the data, loaded into the warehouse. We can create a simple file sources.yml in the model folder:
version: 2
sources:
- name: avocado
database: winter-cocoa-336514
schema: avocado_data
tables:
- name: avocado_prices
identifier: avocado_prices
We identified the location of the source, in this example from another database, with the name of the table and schema. Now we can change our code for src_avocado.sql, so that it picks the destination of the table from the sources.yml file.
WITH raw_avocado AS (
SELECT *
FROM {{ source('avocado', 'avocado_prices')}}
)
SELECT
date,
year,
region as origin_region,
AveragePrice as avg_price,
Total_Volume as total_volume,
Total_Bags as total_bags
FROM raw_avocado
We may have numerous source tables that we’ll use in various models. The destination of the source table may also change from time to time. To make modifications in the sources.yml file without the need to edit every .sql file, we can make small changes.
To verify if all connections are properly established, we can use a command that checks connections in the project. However, we should also double-check the yml file to ensure that everything is written correctly.
dbt compile
Snapshots
Snapshots allow you to record all historical records if you have some changes to the existing ones. It’s not always needed to have all the historical data available, but it might be a very simple option for you to save all the data. Essentially, snapshots implement type-2 Slowly Changing Dimensions.
As an example, I create a scd_raw_prices.sql file and add the following code:
{% snapshot scd_raw_prices %}
{{
config(
target_schema='dbt_ihorluk',
unique_key='SurrogateKey',
strategy='timestamp',
updated_at='Date',
invalidate_hard_deletes=True
)
}}
SELECT * FROM {{ source('avocado', 'avocado_prices')}}
{% endsnapshot %}
Here I use a SurroagteKey as a unique identifier, the strategy is the timestamp (we follow the changes in time), and use column ‘Date’ to check the changes. To use this snapshot we use this command:
dbt snapshot
If everything is correct, we can see changes in the BigQuery schema:
This new table scd_raw_prices has some useful columns, created by the dbt:
Here we have four new columns:
- dbt_scd_id — identifier for dbt
- dbt_updated_at — when the record was updated.
- dbt_valid_from — the last time this raw was updated. In our case, it equals dbt_updated_at as we had no changes.
- dbt_valid_to — when the change occurred. If the value is null, then the current row is the latest record.
If we will make a change for one row, we will see that dbt_valid_to was changed, as well as a new row created:
Tests
We have two main tests strategies in dbt:
- Generic tests — check for not null, unique, relationships, and accepted values in the models.
- Singular tests — write a SQL SELECT statement which we expect to return no rows. It might be useful when we don't want to have values less than some minimum or more than some maximum (for example, the price should be 0)
Generic tests
When building an automated system, testing is a crucial aspect to consider. In SQL databases, testing has historically presented challenges; however, dbt offers a solution by allowing users to write tests and regulate database performance. By implementing tests, potential issues can be detected early and addressed before they cause significant problems in the data warehouse.
In modern data warehouses, such as BigQuery, maintaining control over data integrity is even more important, especially since there are no primary key constraints like those found in PostgreSQL or other transactional databases. Fortunately, dbt offers generic tests that make it easy to enforce not-null and unique constraints. To create tests, simply create a schema.yml file in the models folder and begin writing tests. Some of the tests that I’ve created include:
- For dim_region, we ensure that the id column is unique and has no missing values. This is our primary key, so all values should be unique without any empty rows. Additionally, we check its relationship to the region_id in the fct_avocado, which serves as a foreign key in that model.
- For dim_year, we only have values for the years 2015–2018. By using the accepted_values test, we can easily verify that no values outside of this range are present.
version: 2
models:
- name: dim_region
columns:
- name: id
tests:
- unique
- not_null
- relationships:
field: region_id
to: ref('fct_avocado')
- name: dim_year
columns:
- name: year
tests:
- accepted_values:
values: ['2015',
'2016',
'2017',
'2018']
quote: false
After this file is created, we can run the test with a command:
dbt test
If everything is correct in the database, we will see that all tests are passed. And if there are any problems occurred, we can start debugging and finding the root of the problem. You can find run tests in the folder target/compiled/{project_name}/models/schema.yml/.
Singular tests
Singular tests are even easier to implement — just create a new .sql file in the tests folder and write some select statements. For our dataset, I will use avg_price from fct_avocado model — it must always be more than 0. This is our test:
SELECT *
FROM {{ ref('fct_avocado')}}
WHERE avg_price <= 0
LIMIT 10
When we run dbt test command again, we will see one more test passed if everything is correct.
If you need only test one particular model, you can write the command with the select option:
dbt test --select fct_avocado
Where to go from here
With a solid understanding of the fundamental concepts in the DBT tool, you can begin to transform your data warehouse. As demonstrated in this article, SQL is typically used to create scripts, but DBT streamlines the transformation process by automating it. While there are more complex concepts to explore, I will delve into those in future articles. For now, armed with this knowledge, you can commence working with DBT. To enhance your proficiency with this tool, here are some helpful tips:
- Using Jinja templated — dbt uses Jinja a lot and you should deep dive into this concept for more freedom of possible solutions.
- With macros, you can create reusable code. It is quite handy when writing different similar tests, for example.
- Packages — adding extra functionality from the dbt packages. Similar to libraries in Python.
- Docs — you can generate documentation for your models and the whole project. It is very useful when working in a team.
- Analyses — if you don’t need a model but rather a simple SQL query for some analytical goals, you can store and control them in the analyses.
- Hooks — for effective database administration.
- Add your BI (or other downstream of the project) with the exposure.
Conclusions
As mentioned earlier, this article provides a comprehensive explanation of all the fundamental concepts. The best way to explore this new data engineering tool is by trying it out yourself on real-world datasets. With just a quick read, you can start transforming your data warehouse right away. I highly recommend preparing your data for external use, such as connecting it to some BI tools (which can be easily done with Superset) and creating visualizations. This workflow is a scaled-down version of what’s used in large corporations.
Once you have gained enough knowledge and practice, you can dive deeper into the documentation (use some recommendations from the previous section) and try out new features. I’m confident that within a few months, you’ll have a comprehensive understanding of this important tool.
I can be found on Linkedin and I am looking forward to connecting with you. Let’s engage in discussions about the intricate world of data science and data engineering.