[TiDB] 数据加载

首先

我使用TiDB创建了一个数据库,并使用LOAD DATA命令加载了示例数据。我按照快速入门指南中的导入示例数据库的步骤进行了操作。

验证行动环境

本文所述的步骤已在以下环境中进行了操作验证。

[root@tisim ~]# tiup cluster display demo-cluster
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.12.1/tiup-cluster display demo-cluster
Cluster type:       tidb
Cluster name:       demo-cluster
Cluster version:    v7.0.0
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://192.168.3.171:2379/dashboard
Grafana URL:        http://192.168.3.171:3000
ID                   Role        Host           Ports                            OS/Arch       Status   Data Dir                    Deploy Dir
--                   ----        ----           -----                            -------       ------   --------                    ----------
192.168.3.171:3000   grafana     192.168.3.171  3000                             linux/x86_64  Up       -                           /tidb-deploy/grafana-3000
192.168.3.171:2379   pd          192.168.3.171  2379/2380                        linux/x86_64  Up|L|UI  /tidb-data/pd-2379          /tidb-deploy/pd-2379
192.168.3.171:9090   prometheus  192.168.3.171  9090/12020                       linux/x86_64  Up       /tidb-data/prometheus-9090  /tidb-deploy/prometheus-9090
192.168.3.171:4000   tidb        192.168.3.171  4000/10080                       linux/x86_64  Up       -                           /tidb-deploy/tidb-4000
192.168.3.171:9000   tiflash     192.168.3.171  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /tidb-data/tiflash-9000     /tidb-deploy/tiflash-9000
192.168.3.171:20160  tikv        192.168.3.171  20160/20180                      linux/x86_64  Up       /tidb-data/tikv-20160       /tidb-deploy/tikv-20160
192.168.3.171:20161  tikv        192.168.3.171  20161/20181                      linux/x86_64  Up       /tidb-data/tikv-20161       /tidb-deploy/tikv-20161
192.168.3.171:20162  tikv        192.168.3.171  20162/20182                      linux/x86_64  Up       /tidb-data/tikv-20162       /tidb-deploy/tikv-20162
Total nodes: 8

数据文件下载

从2010年到2017年,下载并解压缩ZIP文件。CSV文件的大小大约为3GB左右。

[root@tisim ~]# mkdir -p bikeshare-data && cd bikeshare-data
[root@tisim bikeshare-data]# curl -L --remote-name-all https://s3.amazonaws.com/capitalbikeshare-data/{2010..2017}-capitalbikeshare-tripdata.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2351k  100 2351k    0     0  1032k      0  0:00:02  0:00:02 --:--:-- 1032k
100 24.1M  100 24.1M    0     0  5235k      0  0:00:04  0:00:04 --:--:-- 5803k
100 41.4M  100 41.4M    0     0  3688k      0  0:00:11  0:00:11 --:--:-- 4008k
100 53.8M  100 53.8M    0     0  3211k      0  0:00:17  0:00:17 --:--:-- 4242k
100 63.4M  100 63.4M    0     0  5043k      0  0:00:12  0:00:12 --:--:-- 5174k
100 70.1M  100 70.1M    0     0  4473k      0  0:00:16  0:00:16 --:--:-- 4453k
100 74.6M  100 74.6M    0     0  4480k      0  0:00:17  0:00:17 --:--:-- 5672k
100 85.4M  100 85.4M    0     0  2377k      0  0:00:36  0:00:36 --:--:-- 1722k

[root@tisim bikeshare-data]# unzip \*-tripdata.zip
Archive:  2010-capitalbikeshare-tripdata.zip
  inflating: 2010-capitalbikeshare-tripdata.csv
(省略)
Archive:  2017-capitalbikeshare-tripdata.zip
  inflating: 2017Q1-capitalbikeshare-tripdata.csv
  inflating: 2017Q2-capitalbikeshare-tripdata.csv
  inflating: 2017Q3-capitalbikeshare-tripdata.csv
  inflating: 2017Q4-capitalbikeshare-tripdata.csv

8 archives were successfully processed.

创建数据库

使用mysql客户端连接到TiDB,并创建数据库。

[root@tisim bikeshare-data]# mysql -h 192.168.3.171 -P 4000 -uroot
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 409
Server version: 5.7.25-TiDB-v7.0.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 5.7 compatible

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE DATABASE bikeshare;
Query OK, 0 rows affected (0.10 sec)

选择已创建的bikeshare数据库。

mysql> USE bikeshare
Database changed

创建一张表格。

mysql> CREATE TABLE trips (
    ->  trip_id bigint NOT NULL PRIMARY KEY AUTO_INCREMENT,
    ->  duration integer not null,
    ->  start_date datetime,
    ->  end_date datetime,
    ->  start_station_number integer,
    ->  start_station varchar(255),
    ->  end_station_number integer,
    ->  end_station varchar(255),
    ->  bike_number varchar(255),
    ->  member_type varchar(255)
    -> );
Query OK, 0 rows affected (0.09 sec)

使用LOAD DATA LOCAL语句进行加载。尝试只加载一个文件,但出现了错误。

mysql> SET tidb_dml_batch_size = 20000;
Query OK, 0 rows affected (0.00 sec)

mysql> LOAD DATA LOCAL INFILE '2017Q1-capitalbikeshare-tripdata.csv' INTO TABLE trips
    ->   FIELDS TERMINATED BY ',' ENCLOSED BY '"'
    ->   LINES TERMINATED BY '\r\n'
    ->   IGNORE 1 LINES
    -> (duration, start_date, end_date, start_station_number, start_station,
    -> end_station_number, end_station, bike_number, member_type);
ERROR 2068 (HY000): LOAD DATA LOCAL INFILE file request rejected due to restrictions on access.

mysql> SELECT @@local_infile ;
+----------------+
| @@local_infile |
+----------------+
|              1 |
+----------------+
1 row in set (0.00 sec)

mysql> exit
Bye

开启mysql客户端的local_infile,并连接到bikeshare。

[root@tisim bikeshare-data]# mysql -h 192.168.3.171 -P 4000 -uroot bikeshare --local-infile=1
mysql> SELECT DATABASE();
+------------+
| DATABASE() |
+------------+
| bikeshare  |
+------------+
1 row in set (0.00 sec)

再次执行LOAD DATA LOCAL的命令。这次成功了。

mysql> SET tidb_dml_batch_size = 20000;
Query OK, 0 rows affected (0.00 sec)

mysql> LOAD DATA LOCAL INFILE '2017Q1-capitalbikeshare-tripdata.csv' INTO TABLE trips
    ->   FIELDS TERMINATED BY ',' ENCLOSED BY '"'
    ->   LINES TERMINATED BY '\r\n'
    ->   IGNORE 1 LINES
    -> (duration, start_date, end_date, start_station_number, start_station,
    -> end_station_number, end_station, bike_number, member_type);
Query OK, 646510 rows affected (17.02 sec)
Records: 646510  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT COUNT(*) FROM trips;
+----------+
| COUNT(*) |
+----------+
|   646510 |
+----------+
1 row in set (0.21 sec)

一旦执行TRUNCATE操作。

mysql> TRUNCATE TABLE trips;
Query OK, 0 rows affected (0.10 sec)

mysql> SELECT COUNT(*) FROM trips;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> exit
Bye

加载所有的CSV文件。

[root@tisim bikeshare-data]# for FILE in *.csv; do
>  echo "== $FILE =="
> mysql -h 192.168.3.171 -P 4000 -uroot bikeshare --local-infile=1 -e "SET tidb_dml_batch_size = 20000; LOAD DATA LOCAL INFILE '${FILE}' INTO TABLE trips FIELDS TERMINATED BY ',' ENCLOSED BY '\
"' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (duration, start_date, end_date, start_station_number, start_station, end_station_number, end_station, bike_number, member_type);"
> done;
== 2010-capitalbikeshare-tripdata.csv ==
== 2011-capitalbikeshare-tripdata.csv ==
== 2012Q1-capitalbikeshare-tripdata.csv ==
== 2012Q2-capitalbikeshare-tripdata.csv ==
== 2012Q3-capitalbikeshare-tripdata.csv ==
== 2012Q4-capitalbikeshare-tripdata.csv ==
== 2013Q1-capitalbikeshare-tripdata.csv ==
== 2013Q2-capitalbikeshare-tripdata.csv ==
== 2013Q3-capitalbikeshare-tripdata.csv ==
== 2013Q4-capitalbikeshare-tripdata.csv ==
== 2014Q1-capitalbikeshare-tripdata.csv ==
== 2014Q2-capitalbikeshare-tripdata.csv ==
== 2014Q3-capitalbikeshare-tripdata.csv ==
== 2014Q4-capitalbikeshare-tripdata.csv ==
== 2015Q1-capitalbikeshare-tripdata.csv ==
== 2015Q2-capitalbikeshare-tripdata.csv ==
== 2015Q3-capitalbikeshare-tripdata.csv ==
== 2015Q4-capitalbikeshare-tripdata.csv ==
== 2016Q1-capitalbikeshare-tripdata.csv ==
== 2016Q2-capitalbikeshare-tripdata.csv ==
== 2016Q3-capitalbikeshare-tripdata.csv ==
== 2016Q4-capitalbikeshare-tripdata.csv ==
== 2017Q1-capitalbikeshare-tripdata.csv ==
== 2017Q2-capitalbikeshare-tripdata.csv ==
== 2017Q3-capitalbikeshare-tripdata.csv ==
== 2017Q4-capitalbikeshare-tripdata.csv ==

我要确认一下已加载的行数。

mysql> SELECT COUNT(*) FROM trips;
+----------+
| COUNT(*) |
+----------+
| 19117643 |
+----------+
1 row in set (3.05 sec)

我将确认执行计划。

    • cop[tikv]のtaskがTiKVへオフロードしている処理を表しており、TiKVが、bike_number = ‘W00742’に一致する行を取得して行数を集計していることが確認できます。

 

    rootのtaskがTiDB側の処理を表しており、TiDBからの結果を集約していることが確認できます。
mysql> EXPLAIN SELECT COUNT(*) FROM trips WHERE bike_number = 'W00742';
+------------------------------+-------------+-----------+---------------+-------------------------------------------+
| id                           | estRows     | task      | access object | operator info                             |
+------------------------------+-------------+-----------+---------------+-------------------------------------------+
| HashAgg_13                   | 1.00        | root      |               | funcs:count(Column#12)->Column#11         |
| └─TableReader_14             | 1.00        | root      |               | data:HashAgg_6                            |
|   └─HashAgg_6                | 1.00        | cop[tikv] |               | funcs:count(1)->Column#12                 |
|     └─Selection_12           | 9386.78     | cop[tikv] |               | eq(bikeshare.trips.bike_number, "W00742") |
|       └─TableFullScan_11     | 19938837.00 | cop[tikv] | table:trips   | keep order:false                          |
+------------------------------+-------------+-----------+---------------+-------------------------------------------+
5 rows in set (0.01 sec)

总结而言

在执行数据加载到mysql服务器上的同时,我可以使用与之相同的方式来执行。在确认加载后的执行计划时,我理解到了TiKV将处理工作卸载的事实。只有经过TiKV端过滤的数据才会被传输到TiDB端,这样可以期望减少TiDB端的CPU资源和数据传输量。在像商业环境中TiDB节点和TiKV节点分别在不同服务器上运行的环境中,可以受益匪浅。

广告
将在 10 秒后关闭
bannerAds