[TiDB] 数据加载
首先
我使用TiDB创建了一个数据库,并使用LOAD DATA命令加载了示例数据。我按照快速入门指南中的导入示例数据库的步骤进行了操作。
验证行动环境
本文所述的步骤已在以下环境中进行了操作验证。
[root@tisim ~]# tiup cluster display demo-cluster
tiup is checking updates for component cluster ...
Starting component `cluster`: /root/.tiup/components/cluster/v1.12.1/tiup-cluster display demo-cluster
Cluster type: tidb
Cluster name: demo-cluster
Cluster version: v7.0.0
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://192.168.3.171:2379/dashboard
Grafana URL: http://192.168.3.171:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
192.168.3.171:3000 grafana 192.168.3.171 3000 linux/x86_64 Up - /tidb-deploy/grafana-3000
192.168.3.171:2379 pd 192.168.3.171 2379/2380 linux/x86_64 Up|L|UI /tidb-data/pd-2379 /tidb-deploy/pd-2379
192.168.3.171:9090 prometheus 192.168.3.171 9090/12020 linux/x86_64 Up /tidb-data/prometheus-9090 /tidb-deploy/prometheus-9090
192.168.3.171:4000 tidb 192.168.3.171 4000/10080 linux/x86_64 Up - /tidb-deploy/tidb-4000
192.168.3.171:9000 tiflash 192.168.3.171 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /tidb-data/tiflash-9000 /tidb-deploy/tiflash-9000
192.168.3.171:20160 tikv 192.168.3.171 20160/20180 linux/x86_64 Up /tidb-data/tikv-20160 /tidb-deploy/tikv-20160
192.168.3.171:20161 tikv 192.168.3.171 20161/20181 linux/x86_64 Up /tidb-data/tikv-20161 /tidb-deploy/tikv-20161
192.168.3.171:20162 tikv 192.168.3.171 20162/20182 linux/x86_64 Up /tidb-data/tikv-20162 /tidb-deploy/tikv-20162
Total nodes: 8
数据文件下载
从2010年到2017年,下载并解压缩ZIP文件。CSV文件的大小大约为3GB左右。
[root@tisim ~]# mkdir -p bikeshare-data && cd bikeshare-data
[root@tisim bikeshare-data]# curl -L --remote-name-all https://s3.amazonaws.com/capitalbikeshare-data/{2010..2017}-capitalbikeshare-tripdata.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2351k 100 2351k 0 0 1032k 0 0:00:02 0:00:02 --:--:-- 1032k
100 24.1M 100 24.1M 0 0 5235k 0 0:00:04 0:00:04 --:--:-- 5803k
100 41.4M 100 41.4M 0 0 3688k 0 0:00:11 0:00:11 --:--:-- 4008k
100 53.8M 100 53.8M 0 0 3211k 0 0:00:17 0:00:17 --:--:-- 4242k
100 63.4M 100 63.4M 0 0 5043k 0 0:00:12 0:00:12 --:--:-- 5174k
100 70.1M 100 70.1M 0 0 4473k 0 0:00:16 0:00:16 --:--:-- 4453k
100 74.6M 100 74.6M 0 0 4480k 0 0:00:17 0:00:17 --:--:-- 5672k
100 85.4M 100 85.4M 0 0 2377k 0 0:00:36 0:00:36 --:--:-- 1722k
[root@tisim bikeshare-data]# unzip \*-tripdata.zip
Archive: 2010-capitalbikeshare-tripdata.zip
inflating: 2010-capitalbikeshare-tripdata.csv
(省略)
Archive: 2017-capitalbikeshare-tripdata.zip
inflating: 2017Q1-capitalbikeshare-tripdata.csv
inflating: 2017Q2-capitalbikeshare-tripdata.csv
inflating: 2017Q3-capitalbikeshare-tripdata.csv
inflating: 2017Q4-capitalbikeshare-tripdata.csv
8 archives were successfully processed.
创建数据库
使用mysql客户端连接到TiDB,并创建数据库。
[root@tisim bikeshare-data]# mysql -h 192.168.3.171 -P 4000 -uroot
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 409
Server version: 5.7.25-TiDB-v7.0.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 5.7 compatible
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE DATABASE bikeshare;
Query OK, 0 rows affected (0.10 sec)
选择已创建的bikeshare数据库。
mysql> USE bikeshare
Database changed
创建一张表格。
mysql> CREATE TABLE trips (
-> trip_id bigint NOT NULL PRIMARY KEY AUTO_INCREMENT,
-> duration integer not null,
-> start_date datetime,
-> end_date datetime,
-> start_station_number integer,
-> start_station varchar(255),
-> end_station_number integer,
-> end_station varchar(255),
-> bike_number varchar(255),
-> member_type varchar(255)
-> );
Query OK, 0 rows affected (0.09 sec)
使用LOAD DATA LOCAL语句进行加载。尝试只加载一个文件,但出现了错误。
mysql> SET tidb_dml_batch_size = 20000;
Query OK, 0 rows affected (0.00 sec)
mysql> LOAD DATA LOCAL INFILE '2017Q1-capitalbikeshare-tripdata.csv' INTO TABLE trips
-> FIELDS TERMINATED BY ',' ENCLOSED BY '"'
-> LINES TERMINATED BY '\r\n'
-> IGNORE 1 LINES
-> (duration, start_date, end_date, start_station_number, start_station,
-> end_station_number, end_station, bike_number, member_type);
ERROR 2068 (HY000): LOAD DATA LOCAL INFILE file request rejected due to restrictions on access.
mysql> SELECT @@local_infile ;
+----------------+
| @@local_infile |
+----------------+
| 1 |
+----------------+
1 row in set (0.00 sec)
mysql> exit
Bye
开启mysql客户端的local_infile,并连接到bikeshare。
[root@tisim bikeshare-data]# mysql -h 192.168.3.171 -P 4000 -uroot bikeshare --local-infile=1
mysql> SELECT DATABASE();
+------------+
| DATABASE() |
+------------+
| bikeshare |
+------------+
1 row in set (0.00 sec)
再次执行LOAD DATA LOCAL的命令。这次成功了。
mysql> SET tidb_dml_batch_size = 20000;
Query OK, 0 rows affected (0.00 sec)
mysql> LOAD DATA LOCAL INFILE '2017Q1-capitalbikeshare-tripdata.csv' INTO TABLE trips
-> FIELDS TERMINATED BY ',' ENCLOSED BY '"'
-> LINES TERMINATED BY '\r\n'
-> IGNORE 1 LINES
-> (duration, start_date, end_date, start_station_number, start_station,
-> end_station_number, end_station, bike_number, member_type);
Query OK, 646510 rows affected (17.02 sec)
Records: 646510 Deleted: 0 Skipped: 0 Warnings: 0
mysql> SELECT COUNT(*) FROM trips;
+----------+
| COUNT(*) |
+----------+
| 646510 |
+----------+
1 row in set (0.21 sec)
一旦执行TRUNCATE操作。
mysql> TRUNCATE TABLE trips;
Query OK, 0 rows affected (0.10 sec)
mysql> SELECT COUNT(*) FROM trips;
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
1 row in set (0.00 sec)
mysql> exit
Bye
加载所有的CSV文件。
[root@tisim bikeshare-data]# for FILE in *.csv; do
> echo "== $FILE =="
> mysql -h 192.168.3.171 -P 4000 -uroot bikeshare --local-infile=1 -e "SET tidb_dml_batch_size = 20000; LOAD DATA LOCAL INFILE '${FILE}' INTO TABLE trips FIELDS TERMINATED BY ',' ENCLOSED BY '\
"' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (duration, start_date, end_date, start_station_number, start_station, end_station_number, end_station, bike_number, member_type);"
> done;
== 2010-capitalbikeshare-tripdata.csv ==
== 2011-capitalbikeshare-tripdata.csv ==
== 2012Q1-capitalbikeshare-tripdata.csv ==
== 2012Q2-capitalbikeshare-tripdata.csv ==
== 2012Q3-capitalbikeshare-tripdata.csv ==
== 2012Q4-capitalbikeshare-tripdata.csv ==
== 2013Q1-capitalbikeshare-tripdata.csv ==
== 2013Q2-capitalbikeshare-tripdata.csv ==
== 2013Q3-capitalbikeshare-tripdata.csv ==
== 2013Q4-capitalbikeshare-tripdata.csv ==
== 2014Q1-capitalbikeshare-tripdata.csv ==
== 2014Q2-capitalbikeshare-tripdata.csv ==
== 2014Q3-capitalbikeshare-tripdata.csv ==
== 2014Q4-capitalbikeshare-tripdata.csv ==
== 2015Q1-capitalbikeshare-tripdata.csv ==
== 2015Q2-capitalbikeshare-tripdata.csv ==
== 2015Q3-capitalbikeshare-tripdata.csv ==
== 2015Q4-capitalbikeshare-tripdata.csv ==
== 2016Q1-capitalbikeshare-tripdata.csv ==
== 2016Q2-capitalbikeshare-tripdata.csv ==
== 2016Q3-capitalbikeshare-tripdata.csv ==
== 2016Q4-capitalbikeshare-tripdata.csv ==
== 2017Q1-capitalbikeshare-tripdata.csv ==
== 2017Q2-capitalbikeshare-tripdata.csv ==
== 2017Q3-capitalbikeshare-tripdata.csv ==
== 2017Q4-capitalbikeshare-tripdata.csv ==
我要确认一下已加载的行数。
mysql> SELECT COUNT(*) FROM trips;
+----------+
| COUNT(*) |
+----------+
| 19117643 |
+----------+
1 row in set (3.05 sec)
我将确认执行计划。
-
- cop[tikv]のtaskがTiKVへオフロードしている処理を表しており、TiKVが、bike_number = ‘W00742’に一致する行を取得して行数を集計していることが確認できます。
- rootのtaskがTiDB側の処理を表しており、TiDBからの結果を集約していることが確認できます。
mysql> EXPLAIN SELECT COUNT(*) FROM trips WHERE bike_number = 'W00742';
+------------------------------+-------------+-----------+---------------+-------------------------------------------+
| id | estRows | task | access object | operator info |
+------------------------------+-------------+-----------+---------------+-------------------------------------------+
| HashAgg_13 | 1.00 | root | | funcs:count(Column#12)->Column#11 |
| └─TableReader_14 | 1.00 | root | | data:HashAgg_6 |
| └─HashAgg_6 | 1.00 | cop[tikv] | | funcs:count(1)->Column#12 |
| └─Selection_12 | 9386.78 | cop[tikv] | | eq(bikeshare.trips.bike_number, "W00742") |
| └─TableFullScan_11 | 19938837.00 | cop[tikv] | table:trips | keep order:false |
+------------------------------+-------------+-----------+---------------+-------------------------------------------+
5 rows in set (0.01 sec)
总结而言
在执行数据加载到mysql服务器上的同时,我可以使用与之相同的方式来执行。在确认加载后的执行计划时,我理解到了TiKV将处理工作卸载的事实。只有经过TiKV端过滤的数据才会被传输到TiDB端,这样可以期望减少TiDB端的CPU资源和数据传输量。在像商业环境中TiDB节点和TiKV节点分别在不同服务器上运行的环境中,可以受益匪浅。