我尝试用Apache Drill编程的体验
在午餐时,我稍微听说了Apache Drill的事情。
听说它可以使用类似SQL的查询将任何RDS、DynamoDB的表和文件系统上,以及S3上的JSON/tsv/csv文件进行连接。
因为看起来有趣,所以我想试一试。
文件
安装
使用brew进行一击
brew install apache-drill
模拟数据创建
使用Mockaroo创建模拟数据。
JSON(列的参数保持默认设置)。
[{"id":1,"gender":"Male","first_name":"Willie","last_name":"Gardner","email":"wgardner0@csmonitor.com","ip_address":"171.75.157.165"},
{"id":2,"gender":"Female","first_name":"Lois","last_name":"Jacobs","email":"ljacobs1@photobucket.com","ip_address":"188.171.106.166"},
{"id":3,"gender":"Male","first_name":"Shawn","last_name":"Rivera","email":"srivera2@storify.com","ip_address":"160.224.138.213"},
{"id":4,"gender":"Female","first_name":"Stephanie","last_name":"Hernandez","email":"shernandez3@shinystat.com","ip_address":"77.67.203.114"},
{"id":5,"gender":"Male","first_name":"Ralph","last_name":"Wright","email":"rwright4@booking.com","ip_address":"89.240.134.189"},
....
请添加CSV中的color和company_name字段。
id,first_name,last_name,email,gender,ip_address,color,company_name
1,Kenneth,Wheeler,kwheeler0@domainmarket.com,Male,54.230.82.29,Khaki,Voonyx
2,Brenda,Young,byoung1@networkadvertising.org,Female,79.121.226.209,Goldenrod,Feedbug
3,Steven,Knight,sknight2@hud.gov,Male,161.79.31.161,Purple,Aimbo
4,Timothy,Daniels,tdaniels3@ovh.net,Male,233.15.252.170,Mauv,Oba
5,Adam,Wright,awright4@cbc.ca,Male,56.247.113.21,Indigo,Kwilith
....
确认动作
首先进入钻孔壳。就钻下去吧!!
$ sqlline -u jdbc:drill:zk=local
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Jan 05, 2016 3:32:08 PM org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... apache drill 1.4.0
"just drill it"
0: jdbc:drill:zk=local>
也提供了Web控制台,在进入shell状态下可访问http://127.0.0.1:8047/
让我们来解析一下 JSON 数据
0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.json` AS tbl LIMIT 10;
+-----+---------+-------------+------------+------------------------------------+------------------+
| id | gender | first_name | last_name | email | ip_address |
+-----+---------+-------------+------------+------------------------------------+------------------+
| 1 | Male | Willie | Gardner | wgardner0@csmonitor.com | 171.75.157.165 |
| 2 | Female | Lois | Jacobs | ljacobs1@photobucket.com | 188.171.106.166 |
| 3 | Male | Shawn | Rivera | srivera2@storify.com | 160.224.138.213 |
| 4 | Female | Stephanie | Hernandez | shernandez3@shinystat.com | 77.67.203.114 |
| 5 | Male | Ralph | Wright | rwright4@booking.com | 89.240.134.189 |
| 6 | Male | Donald | Graham | dgraham5@google.com.br | 239.6.250.80 |
| 7 | Male | Justin | Howell | jhowell6@dion.ne.jp | 159.104.8.8 |
| 8 | Male | Russell | Mills | rmills7@nsw.gov.au | 245.130.6.110 |
| 9 | Female | Rose | Gonzalez | rgonzalez8@scientificamerican.com | 16.9.119.25 |
| 10 | Female | Mildred | Ramos | mramos9@unc.edu | 82.232.80.156 |
+-----+---------+-------------+------------+------------------------------------+------------------+
10 rows selected (0.215 seconds)
哇哈哈哈。这真不错。而且还挺快的。
让我们也尝试一下csv的方法
0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.csv` AS tbl LIMIT 10;
+----------------------------------------------------------------------------------------------------------+
| columns |
+----------------------------------------------------------------------------------------------------------+
| ["id","first_name","last_name","email","gender","ip_address","color","company_name"] |
| ["1","Kenneth","Wheeler","kwheeler0@domainmarket.com","Male","54.230.82.29","Khaki","Voonyx"] |
| ["2","Brenda","Young","byoung1@networkadvertising.org","Female","79.121.226.209","Goldenrod","Feedbug"] |
| ["3","Steven","Knight","sknight2@hud.gov","Male","161.79.31.161","Purple","Aimbo"] |
| ["4","Timothy","Daniels","tdaniels3@ovh.net","Male","233.15.252.170","Mauv","Oba"] |
| ["5","Adam","Wright","awright4@cbc.ca","Male","56.247.113.21","Indigo","Kwilith"] |
| ["6","Jane","Wheeler","jwheeler5@unc.edu","Female","42.132.133.182","Teal","Gabcube"] |
| ["7","Jack","Smith","jsmith6@domainmarket.com","Male","199.219.253.212","Fuscia","Pixope"] |
| ["8","James","Hayes","jhayes7@businessinsider.com","Male","237.34.224.202","Purple","Yabox"] |
| ["9","Susan","Murphy","smurphy8@joomla.org","Female","85.200.49.196","Turquoise","Jaloo"] |
+----------------------------------------------------------------------------------------------------------+
哦哦。原来CSV是这样的。这样就可以用JOIN了吗?可能有些选项吧。
我打算查看文档,但只是粗略地看了一下发现有这个内容。
只要定义这个范围,就可以了。
"skipFirstLine": false,
"extractHeader": true,
听说不是在cfg文件中设置,而是要在Web控制台或使用curl进行设置。
这次我们决定在Web控制台上修改现有csv的定义。
点击菜单选项卡中的存储,然后点击dfs的“更新”按钮。
0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.csv` AS tbl LIMIT 10;
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
| id | first_name | last_name | email | gender | ip_address | color | company_name |
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
| 1 | Kenneth | Wheeler | kwheeler0@domainmarket.com | Male | 54.230.82.29 | Khaki | Voonyx |
| 2 | Brenda | Young | byoung1@networkadvertising.org | Female | 79.121.226.209 | Goldenrod | Feedbug |
| 3 | Steven | Knight | sknight2@hud.gov | Male | 161.79.31.161 | Purple | Aimbo |
| 4 | Timothy | Daniels | tdaniels3@ovh.net | Male | 233.15.252.170 | Mauv | Oba |
| 5 | Adam | Wright | awright4@cbc.ca | Male | 56.247.113.21 | Indigo | Kwilith |
| 6 | Jane | Wheeler | jwheeler5@unc.edu | Female | 42.132.133.182 | Teal | Gabcube |
| 7 | Jack | Smith | jsmith6@domainmarket.com | Male | 199.219.253.212 | Fuscia | Pixope |
| 8 | James | Hayes | jhayes7@businessinsider.com | Male | 237.34.224.202 | Purple | Yabox |
| 9 | Susan | Murphy | smurphy8@joomla.org | Female | 85.200.49.196 | Turquoise | Jaloo |
| 10 | Ann | Hill | ahill9@youku.com | Female | 254.251.33.227 | Red | Yadel |
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
太好了!
大家加入大正義吧!
如果你能做到这个,我感到非常兴奋。
0: jdbc:drill:zk=local> SELECT tbl1.first_name, tbl1.last_name, tbl2.color, tbl2.company_name FROM dfs.`/tmp/work/MOCK_DATA.json` AS tbl1
. . . . . . . . . . . > JOIN
. . . . . . . . . . . > dfs.`/tmp/work/MOCK_DATA.csv` AS tbl2 ON tbl1.first_name=tbl2.first_name and tbl1.last_name=tbl2.last_name;
+-------------+-------------+-------------+----------------+
| first_name | last_name | color | company_name |
+-------------+-------------+-------------+----------------+
| Scott | Ray | Maroon | Twinte |
| Andrew | Brown | Orange | Photolist |
| Rose | Washington | Orange | Livetube |
| Steven | Adams | Maroon | Zoombeat |
| Nancy | Porter | Indigo | Kwinu |
| Lois | Williams | Purple | Flashset |
| Justin | Hart | Red | Miboo |
| Arthur | Scott | Indigo | Thoughtbeat |
| Wanda | Peters | Khaki | Teklist |
| Christine | Dean | Aquamarine | Buzzshare |
| Paula | Nelson | Green | Kare |
| Betty | Wagner | Aquamarine | Photobean |
| Jerry | Ross | Mauv | Linktype |
| Harry | Washington | Green | Reallinks |
| Nancy | Lynch | Turquoise | Twitternation |
| Heather | Moore | Violet | Buzzbean |
| Linda | Gray | Violet | Fadeo |
+-------------+-------------+-------------+----------------+
17 rows selected (1.026 seconds)
虽然这个查询没有太多意义,但是能成功地进行了连接。太好了。
如果有可用于工作的案例,我将尝试在大规模数据和Spark上进行验证和比较。