我尝试用Apache Drill编程的体验

在午餐时,我稍微听说了Apache Drill的事情。
听说它可以使用类似SQL的查询将任何RDS、DynamoDB的表和文件系统上,以及S3上的JSON/tsv/csv文件进行连接。

因为看起来有趣,所以我想试一试。

文件

安装

使用brew进行一击

brew install apache-drill

模拟数据创建

使用Mockaroo创建模拟数据。

JSON(列的参数保持默认设置)。

[{"id":1,"gender":"Male","first_name":"Willie","last_name":"Gardner","email":"wgardner0@csmonitor.com","ip_address":"171.75.157.165"},
{"id":2,"gender":"Female","first_name":"Lois","last_name":"Jacobs","email":"ljacobs1@photobucket.com","ip_address":"188.171.106.166"},
{"id":3,"gender":"Male","first_name":"Shawn","last_name":"Rivera","email":"srivera2@storify.com","ip_address":"160.224.138.213"},
{"id":4,"gender":"Female","first_name":"Stephanie","last_name":"Hernandez","email":"shernandez3@shinystat.com","ip_address":"77.67.203.114"},
{"id":5,"gender":"Male","first_name":"Ralph","last_name":"Wright","email":"rwright4@booking.com","ip_address":"89.240.134.189"},
....

请添加CSV中的color和company_name字段。

id,first_name,last_name,email,gender,ip_address,color,company_name
1,Kenneth,Wheeler,kwheeler0@domainmarket.com,Male,54.230.82.29,Khaki,Voonyx
2,Brenda,Young,byoung1@networkadvertising.org,Female,79.121.226.209,Goldenrod,Feedbug
3,Steven,Knight,sknight2@hud.gov,Male,161.79.31.161,Purple,Aimbo
4,Timothy,Daniels,tdaniels3@ovh.net,Male,233.15.252.170,Mauv,Oba
5,Adam,Wright,awright4@cbc.ca,Male,56.247.113.21,Indigo,Kwilith
....

确认动作

首先进入钻孔壳。就钻下去吧!!

$ sqlline -u jdbc:drill:zk=local
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Jan 05, 2016 3:32:08 PM org.glassfish.jersey.server.ApplicationHandler initialize                                                                  INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...                                                                  apache drill 1.4.0
"just drill it"
 0: jdbc:drill:zk=local>

也提供了Web控制台,在进入shell状态下可访问http://127.0.0.1:8047/

Screen Shot 2016-01-05 at 16.06.45.png

让我们来解析一下 JSON 数据

0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.json` AS tbl LIMIT 10;
+-----+---------+-------------+------------+------------------------------------+------------------+
| id  | gender  | first_name  | last_name  |               email                |    ip_address    |
+-----+---------+-------------+------------+------------------------------------+------------------+
| 1   | Male    | Willie      | Gardner    | wgardner0@csmonitor.com            | 171.75.157.165   |
| 2   | Female  | Lois        | Jacobs     | ljacobs1@photobucket.com           | 188.171.106.166  |
| 3   | Male    | Shawn       | Rivera     | srivera2@storify.com               | 160.224.138.213  |
| 4   | Female  | Stephanie   | Hernandez  | shernandez3@shinystat.com          | 77.67.203.114    |
| 5   | Male    | Ralph       | Wright     | rwright4@booking.com               | 89.240.134.189   |
| 6   | Male    | Donald      | Graham     | dgraham5@google.com.br             | 239.6.250.80     |
| 7   | Male    | Justin      | Howell     | jhowell6@dion.ne.jp                | 159.104.8.8      |
| 8   | Male    | Russell     | Mills      | rmills7@nsw.gov.au                 | 245.130.6.110    |
| 9   | Female  | Rose        | Gonzalez   | rgonzalez8@scientificamerican.com  | 16.9.119.25      |
| 10  | Female  | Mildred     | Ramos      | mramos9@unc.edu                    | 82.232.80.156    |
+-----+---------+-------------+------------+------------------------------------+------------------+
10 rows selected (0.215 seconds)

哇哈哈哈。这真不错。而且还挺快的。

让我们也尝试一下csv的方法

0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.csv` AS tbl  LIMIT 10;
+----------------------------------------------------------------------------------------------------------+
|                                                 columns                                                  |
+----------------------------------------------------------------------------------------------------------+
| ["id","first_name","last_name","email","gender","ip_address","color","company_name"]                     |
| ["1","Kenneth","Wheeler","kwheeler0@domainmarket.com","Male","54.230.82.29","Khaki","Voonyx"]            |
| ["2","Brenda","Young","byoung1@networkadvertising.org","Female","79.121.226.209","Goldenrod","Feedbug"]  |
| ["3","Steven","Knight","sknight2@hud.gov","Male","161.79.31.161","Purple","Aimbo"]                       |
| ["4","Timothy","Daniels","tdaniels3@ovh.net","Male","233.15.252.170","Mauv","Oba"]                       |
| ["5","Adam","Wright","awright4@cbc.ca","Male","56.247.113.21","Indigo","Kwilith"]                        |
| ["6","Jane","Wheeler","jwheeler5@unc.edu","Female","42.132.133.182","Teal","Gabcube"]                    |
| ["7","Jack","Smith","jsmith6@domainmarket.com","Male","199.219.253.212","Fuscia","Pixope"]               |
| ["8","James","Hayes","jhayes7@businessinsider.com","Male","237.34.224.202","Purple","Yabox"]             |
| ["9","Susan","Murphy","smurphy8@joomla.org","Female","85.200.49.196","Turquoise","Jaloo"]                |
+----------------------------------------------------------------------------------------------------------+

哦哦。原来CSV是这样的。这样就可以用JOIN了吗?可能有些选项吧。

我打算查看文档,但只是粗略地看了一下发现有这个内容。

只要定义这个范围,就可以了。

  "skipFirstLine": false,
  "extractHeader": true,

听说不是在cfg文件中设置,而是要在Web控制台或使用curl进行设置。
这次我们决定在Web控制台上修改现有csv的定义。

点击菜单选项卡中的存储,然后点击dfs的“更新”按钮。

Screen_Shot_2016-01-05_at_16_16_36.png
Screen_Shot_2016-01-05_at_16_27_39.png
0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.csv` AS tbl  LIMIT 10;
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
| id  | first_name  | last_name  |              email              | gender  |    ip_address    |   color    | company_name  |
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
| 1   | Kenneth     | Wheeler    | kwheeler0@domainmarket.com      | Male    | 54.230.82.29     | Khaki      | Voonyx        |
| 2   | Brenda      | Young      | byoung1@networkadvertising.org  | Female  | 79.121.226.209   | Goldenrod  | Feedbug       |
| 3   | Steven      | Knight     | sknight2@hud.gov                | Male    | 161.79.31.161    | Purple     | Aimbo         |
| 4   | Timothy     | Daniels    | tdaniels3@ovh.net               | Male    | 233.15.252.170   | Mauv       | Oba           |
| 5   | Adam        | Wright     | awright4@cbc.ca                 | Male    | 56.247.113.21    | Indigo     | Kwilith       |
| 6   | Jane        | Wheeler    | jwheeler5@unc.edu               | Female  | 42.132.133.182   | Teal       | Gabcube       |
| 7   | Jack        | Smith      | jsmith6@domainmarket.com        | Male    | 199.219.253.212  | Fuscia     | Pixope        |
| 8   | James       | Hayes      | jhayes7@businessinsider.com     | Male    | 237.34.224.202   | Purple     | Yabox         |
| 9   | Susan       | Murphy     | smurphy8@joomla.org             | Female  | 85.200.49.196    | Turquoise  | Jaloo         |
| 10  | Ann         | Hill       | ahill9@youku.com                | Female  | 254.251.33.227   | Red        | Yadel         |
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+

太好了!

大家加入大正義吧!

如果你能做到这个,我感到非常兴奋。

0: jdbc:drill:zk=local> SELECT tbl1.first_name, tbl1.last_name, tbl2.color, tbl2.company_name FROM dfs.`/tmp/work/MOCK_DATA.json` AS tbl1
. . . . . . . . . . . > JOIN
. . . . . . . . . . . > dfs.`/tmp/work/MOCK_DATA.csv` AS tbl2 ON tbl1.first_name=tbl2.first_name and tbl1.last_name=tbl2.last_name;
+-------------+-------------+-------------+----------------+
| first_name  |  last_name  |    color    |  company_name  |
+-------------+-------------+-------------+----------------+
| Scott       | Ray         | Maroon      | Twinte         |
| Andrew      | Brown       | Orange      | Photolist      |
| Rose        | Washington  | Orange      | Livetube       |
| Steven      | Adams       | Maroon      | Zoombeat       |
| Nancy       | Porter      | Indigo      | Kwinu          |
| Lois        | Williams    | Purple      | Flashset       |
| Justin      | Hart        | Red         | Miboo          |
| Arthur      | Scott       | Indigo      | Thoughtbeat    |
| Wanda       | Peters      | Khaki       | Teklist        |
| Christine   | Dean        | Aquamarine  | Buzzshare      |
| Paula       | Nelson      | Green       | Kare           |
| Betty       | Wagner      | Aquamarine  | Photobean      |
| Jerry       | Ross        | Mauv        | Linktype       |
| Harry       | Washington  | Green       | Reallinks      |
| Nancy       | Lynch       | Turquoise   | Twitternation  |
| Heather     | Moore       | Violet      | Buzzbean       |
| Linda       | Gray        | Violet      | Fadeo          |
+-------------+-------------+-------------+----------------+
17 rows selected (1.026 seconds)

虽然这个查询没有太多意义,但是能成功地进行了连接。太好了。

如果有可用于工作的案例,我将尝试在大规模数据和Spark上进行验证和比较。

广告
将在 10 秒后关闭
bannerAds