Data Transformation on TiDB Made Easier

In this tutorial, I will show you how to use dbt with TiDB.

  • TiDB 5.3 or later
  • dbt 1.01 or later
  • dbt-tidb 1.0.0

Installation

There are several ways you can install dbt and dbt-tidb, In this tutorial, we will use pypi. When you install dbt-tidb, dbt is installed as a dependency. So you only need one command to install both:

$ pip install dbt-tidb

Creating the project: jaffle shop

dbt-lab provides a project, jaffle_shop, to demonstrate dbt’s functionality. You can get the project directly from GitHub:

$ git clone https://github.com/dbt-labs/jaffle_shop
$ cd jaffle_shop
ubuntu@ubuntu:~/jaffle_shop$ tree
.
├── dbt_project.yml
├── etc
│ ├── dbdiagram_definition.txt
│ └── jaffle_shop_erd.png
├── LICENSE
├── models
│ ├── customers.sql
│ ├── docs.md
│ ├── orders.sql
│ ├── overview.md
│ ├── schema.yml
│ └── staging
│ ├── schema.yml
│ ├── stg_customers.sql
│ ├── stg_orders.sql
│ └── stg_payments.sql
├── README.md
└── seeds
├── raw_customers.csv
├── raw_orders.csv
└── raw_payments.csv
  • dbt_project.yml is the dbt project configuration file, which holds the project name and database configuration file information.
  • The models directory contains the project’s SQL models and table schemas. Note that the data analyst at your company writes this section. To learn more about models, see dbt Docs.
  • The seed directory stores CSV files that are dumped from database export tools. For example, TiDB can export the table data into CSV files through Dumpling. In the jaffle shop project, these CSV files are used as raw data to be processed.

Configuring the project

To configure the project:

  1. Complete the global configuration. In the user directory, edit the default global profile, ~/.dbt/profiles.yml to configure the connection with TiDB:
$ vi ~/.dbt/profiles.yml
jaffle_shop_tidb: # project name
target: dev # target
outputs:
dev:
type: tidb # adapter type
server: 127.0.0.1
port: 4000
schema: analytics # database name
username: root
password: ""
$ cat dbt_project.yml
name: 'jaffle_shop'

config-version: 2
version: '0.1'

profile: 'jaffle_shop_tidb' # note the modification here

model-paths: ["models"] # model path
seed-paths: ["seeds"] # seed path
test-paths: ["tests"]
analysis-paths: ["analysis"]
macro-paths: ["macros"]

target-path: "target"
clean-targets:
- "target"
- "dbt_modules"
- "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

models:
jaffle_shop:
materialized: table # *.sql which in models/ would be materialized to table
staging:
materialized: view # *.sql which in models/staging/ would bt materialized to view
$ dbt debug​​

Loading CSV files

Now that you have successfully created and configured the project, it’s time to load the CSV data and materialize the CSV as a table in the target database. Note that this step is not generally required for a dbt project because the data items for processing are already in the database.

  1. Load the CSV files by running the following command:
$ dbt seed
Running with dbt=1.0.1
Partial parse save file not found. Starting full parse.
Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
Concurrency: 1 threads (target='dev')
1 of 3 START seed file analytics.raw_customers.................................. [RUN]
1 of 3 OK loaded seed file analytics.raw_customers.............................. [INSERT 100 in 0.19s]
2 of 3 START seed file analytics.raw_orders..................................... [RUN]
2 of 3 OK loaded seed file analytics.raw_orders................................. [INSERT 99 in 0.14s]
3 of 3 START seed file analytics.raw_payments................................... [RUN]
3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA |
| PERFORMANCE_SCHEMA |
| analytics |
| mysql |
| test |
+--------------------+
6 rows in set (0.00 sec)

mysql> show tables;
+---------------------+
| Tables_in_analytics |
+---------------------+
| raw_customers |
| raw_orders |
| raw_payments |
+---------------------+
3 rows in set (0.00 sec)

Running the dbt project

Now you are ready to run the configured projects and finish the data transformation.

  1. Run the dbt project to finish the data transformation:
$ dbt runRunning with dbt=1.0.1
Unable to do partial parsing because profile has changed
Unable to do partial parsing because a project dependency has been added
Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
Concurrency: 1 threads (target='dev')
1 of 5 START view model analytics.stg_customers................................. [RUN]
1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.12s]
2 of 5 START view model analytics.stg_orders.................................... [RUN]
2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.08s]
3 of 5 START view model analytics.stg_payments.................................. [RUN]
3 of 5 OK created view model analytics.stg_payments............................. [SUCCESS 0 in 0.07s]
4 of 5 START table model analytics.customers.................................... [RUN]
4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.16s]
5 of 5 START table model analytics.orders....................................... [RUN]
5 of 5 OK created table model analytics.orders.................................. [SUCCESS 0 in 0.12s]
mysql> show tables;
+---------------------+
| Tables_in_analytics |
+---------------------+
| customers |
| orders |
| raw_customers |
| raw_orders |
| raw_payments |
| stg_customers |
| stg_orders. |
| stg_payments |
+---------------------+
8 rows in set (0.00 sec)

mysql> select * from customers;
+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
| customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |
+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
| 1 | Michael | P. | 2018-01-01 | 2018-02-10 | 2 | 33.0000 |
| 2 | Shawn | M. | 2018-01-11 | 2018-01-11 | 1 | 23.0000 |
| 3 | Kathleen | P. | 2018-01-02 | 2018-03-11 | 3 | 65.0000 |
| 4 | Jimmy | C. | NULL | NULL | NULL | NULL |
| 5 | Katherine | R. | NULL | NULL | NULL | NULL |
| 6 | Sarah | R. | 2018-02-19 | 2018-02-19 | 1 | 8.0000 |
| 7 | Martin | M. | 2018-01-14 | 2018-01-14 | 1 | 26.0000 |
| 8 | Frank | R. | 2018-01-29 | 2018-03-12 | 2 | 45.0000 |

Generating visual documents

dbt lets you generate visual documents that display the overall structure of the project and describe all the tables and views. To generate visual documents:

  1. Generate the document:
$ dbt docs generate
$ dbt docs serve 
Running with dbt=1.0.1
Serving docs at 0.0.0.0:8080

Conclusion

Currently, TiDB supports dbt in TiDB 4.0 and later versions. Earlier versions of TiDB may run into issues when working with dbt. For details, visit the tidb-dbt project on GitHub. To get the most out of dbt, we recommend that you run TiDB 5.3 or later. These versions support all of dbt’s functions.

--

--

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
PingCAP

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap