Data Pragmatist
Posts
The Power of Data Build Tool (dbt)

The Power of Data Build Tool (dbt)

Largest 3D-printed neighborhood nears completion

August 12, 2024

Welcome to learning edition of the Data Pragmatist, your dose of all things data science and AI.

📖 Estimated Reading Time: 5 minutes. Missed our previous editions?

🏃‍♂️ AI detects doping with minimal data LINK

Scientists at Saarland University developed AI software to detect doping in top athletes using minimal data, promising to streamline doping checks at events like the Olympics.
This AI system requires only three urine samples throughout an athlete's career to accurately identify doping, by analyzing seven specific steroid-related characteristics in each sample.
The software identifies unusual patterns in new urine samples, achieving 99% certainty in detecting non-doped athletes, while questionable cases undergo further manual DNA testing for confirmation.

🏠 Largest 3D-printed neighborhood nears completion LINK

A massive 3D printer, known as Vulcan, is constructing 100 homes in Georgetown, Texas, making it the largest 3D-printed neighborhood in the world.
ICON Technology, a Texas-based company, developed the nearly 5-ton, 45-foot wide printer, which builds homes faster, cheaper, and with less waste compared to traditional methods.
This construction method reduces the need for multiple crews, sparking concerns about its impact on the labor force, with the potential to make some jobs obsolete, according to industry experts.

🧠 The Power of Data Build Tool (dbt)

Data analytics has seen significant evolution in recent years, with numerous tools emerging to enhance workflows and deliver better outcomes. Since 2018, I've been extensively using the Data Build Tool (dbt) to build data models in various data warehouses across multiple industries.

Key Features of dbt

dbt is a powerful tool for data transformation, offering features that streamline the development and management of data models. It focuses on the transformation step in the data pipeline and is not an ETL tool. Below are some key features that make dbt an essential tool for Analytics Engineers.

Model Testing for Data Integrity

One of dbt’s most valuable features is its ability to test data models for integrity. It integrates testing directly into the data transformation process, ensuring data quality issues are identified early. Some of the default tests include:

unique: Ensures all values in a column are unique.
not_null: Checks for NULL values in a column.
accepted_values: Verifies that values conform to a specified list.
relationships: Ensures that values in a column exist in another model's column (useful for foreign keys).

As of version 1.8, dbt has introduced unit tests, allowing SQL unit testing. This feature is particularly useful for ensuring the accuracy and reliability of data models.

Types of Models in dbt

dbt supports various model types, each serving specific purposes within the data pipeline:

Snapshots: Capture the state of data at specific points in time, useful for tracking changes over time.
Tables: Materialize query results as persistent tables in the database.
Views: Create virtual tables that provide a dynamic view of data.
Incremental Models: Process and store only new or changed data since the last run, optimizing performance and storage.

Technical Guide: Building a dbt Project with DuckDB

This section provides a step-by-step guide to building a dbt project using DuckDB, a process that involves setting up the environment, creating a project, defining sources and models, and running tests.

Step 1: Setting Up the Environment: Begin by installing dbt and DuckDB. This can be done by creating a virtual Python environment and installing the necessary packages with the command pip install dbt-core dbt-duckdb.
Step 2: Creating the dbt Project: Initialize the dbt project using the command dbt init. You will be guided to name your project and select the database (in this case, DuckDB).
Step 3: Defining Sources and Models: Define your source data in dbt to pull from CSV files. For mutable sources, create snapshots within the snapshots folder. After defining snapshots, create staging views and then define fact and dimension models within the models folder.
Step 4: Running Tests and Building Documentation: Run dbt tests to ensure the integrity of your models and generate the documentation website to get an overview of your data pipeline. This process helps to verify that everything is functioning as expected and that the models are accurately built.
Step 5: Reviewing the Results: Examine the output of your dbt run to ensure that the models and documentation are as expected. DuckDB can be used to query the final dataset directly, allowing you to review the results and verify the success of the transformation process.

Conclusion

dbt has significantly transformed the way data models are built and managed, offering powerful features like model testing, data contracts, and automatic documentation.

Top Data Engineering Tools

1. Snowflake

Functionality: Cloud data platform offering data warehousing, data lakes, and data engineering capabilities.
Key Features: Performance, scale, elasticity, and independent scaling of data workloads.
Use Cases: Data warehousing, data lakes, data science, and developing data applications.

2. dbt (Data Build Tool)

Functionality: Command-line tool for transforming data in warehouses using SQL.
Key Features: Simplifies data transformation workflows, integrates seamlessly with data warehouses.
Use Cases: Data transformation, orchestration in analytics workflows.

3. Secoda

Functionality: Data discovery and observability platform.
Key Features: Consolidates data catalog, lineage, monitoring, and documentation into a single platform. Offers insights into pipeline metadata such as cost, query volume, and popularity.
Use Cases: Data cataloging, lineage tracking, pipeline monitoring, and optimizing data infrastructure.

How did you like today's email?

❤️ Loved it | 💪 Pretty good | 💢 Could do better

If you are interested in contributing to the newsletter, respond to this email. We are looking for contributions from you — our readers to keep the community alive and going.