{"id":60388,"date":"2016-09-27T19:00:25","date_gmt":"2016-09-27T16:00:25","guid":{"rendered":"https:\/\/www.javacodegeeks.com\/?p=60388"},"modified":"2016-09-25T23:22:00","modified_gmt":"2016-09-25T20:22:00","slug":"guide-getting-started-apache-drill","status":"publish","type":"post","link":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html","title":{"rendered":"How to Guide: Getting Started with Apache Drill"},"content":{"rendered":"<p><a href=\"https:\/\/drill.apache.org\/\" target=\"_blank\">Apache Drill<\/a> is an engine that can connect to many different data sources, and provide a SQL interface to them. It&#8217;s not just a wanna-be SQL interface that trips over at anything complex &#8211; it&#8217;s a <a href=\"https:\/\/drill.apache.org\/docs\/sql-reference-introduction\/\" target=\"_blank\">hugely functional one<\/a> including support for many built in functions as well as windowing functions. Whilst it can connect to standard data sources that you&#8217;d be able to query with SQL anyway, like Oracle or MySQL, it can also work with flat files such as CSV or JSON, as well as Avro and Parquet formats. It&#8217;s this capability to run SQL against files that first piqued my interest in Apache Drill. I&#8217;ve been spending a lot of time looking at Big Data architectures and tools, including <a href=\"https:\/\/speakerdeck.com\/rmoff\/unlock-the-value-in-your-big-data-reservoir-using-oracle-big-data-discovery-and-oracle-big-data-spatial-and-graph\" target=\"_blank\">Big Data Discovery<\/a>. As part of this, and experimenting with <a href=\"http:\/\/www.rittmanmead.com\/2016\/07\/stream-analytics-processing-kafka-oracle-stream-analytics\/\" target=\"_blank\">data pipeline options<\/a> one of the gaps that I&#8217;ve found is the functionality to dig through files in their raw state, before they&#8217;ve been brought into something like Hive which would enable their exploration through BDD and other tools.<\/p>\n<p>In this article I&#8217;ll walk through getting started with Apache Drill, and show some of the types of queries that I think are a great example of how useful it can be.<\/p>\n<h2>Getting Started<\/h2>\n<p>It&#8217;s very simple to get going with Apache Drill &#8211; just <a href=\"https:\/\/drill.apache.org\/download\/\" target=\"_blank\">download<\/a> and unpack it, and run. Whilst it can run distributed across machines for performance, it can also run standalone on a laptop.<\/p>\n<p>To launch it<\/p>\n<pre class=\"brush:java\">cd \/opt\/apache-drill-1.7.0\/\r\nbin\/sqlline -u jdbc:drill:zk=local<\/pre>\n<p>If you get <code>No current connection<\/code> or <code>com.fasterxml.jackson.databind.JavaType.isReferenceType()Z<\/code> then you have a <a href=\"http:\/\/rmoff.net\/2016\/06\/20\/apache-drill-not-connected\/\" target=\"_blank\">conflicting JAR problem<\/a> (e.g. I encountered this on Oracle&#8217;s BigDataLite VM), and should launch it with a clean environment<\/p>\n<pre class=\"brush:java\">env -i HOME=\"$HOME\" LC_CTYPE=\"${LC_ALL:-${LC_CTYPE:-$LANG}}\" PATH=\"$PATH\" USER=\"$USER\" \/opt\/apache-drill-1.7.0\/bin\/drill-embedded<\/pre>\n<p>There&#8217;s a built in dataset that you can use for testing:<\/p>\n<pre class=\"brush:java\">USE cp;\r\nSELECT employee_id, first_name FROM `employee.json` limit 5;<\/pre>\n<p>This should return five rows, in a very familiar environment if you&#8217;re used to using SQL*Plus and similar tools:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; USE cp;\r\n+-------+---------------------------------+\r\n|  ok   |             summary             |\r\n+-------+---------------------------------+\r\n| true  | Default schema changed to [cp]  |\r\n+-------+---------------------------------+\r\n1 row selected (1.776 seconds)\r\n0: jdbc:drill:zk=local&gt;     SELECT employee_id, first_name FROM `employee.json` limit 5;\r\n+--------------+-------------+\r\n| employee_id  | first_name  |\r\n+--------------+-------------+\r\n| 1            | Sheri       |\r\n| 2            | Derrick     |\r\n| 4            | Michael     |\r\n| 5            | Maya        |\r\n| 6            | Roberta     |\r\n+--------------+-------------+\r\n5 rows selected (3.624 seconds)<\/pre>\n<p>So far, so SQL, so relational &#8211; so familiar, really. Where Apache Drill starts to deviate from the obvious is its use of storage handlers. In the above query cp is the &#8216;database&#8217; that we&#8217;re running our query against, but this is in fact a &#8220;classpath&#8221; (hence &#8220;cp&#8221;) storage handler that&#8217;s defined by default. Within a &#8216;database&#8217; there are &#8216;schemas&#8217; which are sub-configurations of the storage handler. We&#8217;ll have a look at viewing and defining these later on. For now, it&#8217;s useful to know that you can also list out the available databases:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; show databases;\r\n+---------------------+\r\n|     SCHEMA_NAME     |\r\n+---------------------+\r\n| INFORMATION_SCHEMA  |\r\n| cp.default          |\r\n| dfs.default         |\r\n| dfs.root            |\r\n| dfs.tmp             |\r\n| sys                 |\r\n+---------------------+<\/pre>\n<p>Note <code>databases<\/code> command is a synonym for <code>schemas<\/code>; it&#8217;s the <code>&lt;database&gt;.&lt;schema&gt;<\/code> that&#8217;s returned for both. In Apache Drill the backtick is used to enclose identifiers (such as schema names, column names, and so on), and it&#8217;s quite particular about it. For example, this is valid:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; USE `cp.default`;\r\n+-------+-----------------------------------------+\r\n|  ok   |                 summary                 |\r\n+-------+-----------------------------------------+\r\n| true  | Default schema changed to [cp.default]  |\r\n+-------+-----------------------------------------+\r\n1 row selected (0.171 seconds)<\/pre>\n<p>whilst this isn&#8217;t:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; USE cp.default;\r\nError: PARSE ERROR: Encountered \". default\" at line 1, column 7.\r\nWas expecting one of:\r\n&lt;EOF&gt;\r\n\".\" &lt;IDENTIFIER&gt; ...\r\n\".\" &lt;QUOTED_IDENTIFIER&gt; ...\r\n\".\" &lt;BACK_QUOTED_IDENTIFIER&gt; ...\r\n\".\" &lt;BRACKET_QUOTED_IDENTIFIER&gt; ...\r\n\".\" &lt;UNICODE_QUOTED_IDENTIFIER&gt; ...\r\n\".\" \"*\" ...\r\n\r\nSQL Query USE cp.default<\/pre>\n<p>This is because <code>default<\/code> is a reserved word, and hence must be quoted. Hence, you can also use<\/p>\n<p><code>0: jdbc:drill:zk=local&gt; use cp.`default`;<\/code><\/p>\n<p>but not<\/p>\n<p><code>0: jdbc:drill:zk=local&gt; use `cp`.default;<\/code><\/p>\n<h2>Querying JSON data<\/h2>\n<p>On the Apache Drill website there&#8217;s some useful <a href=\"https:\/\/drill.apache.org\/docs\/tutorials\/\" target=\"_blank\">tutorials<\/a>, including one using data <a href=\"http:\/\/www.yelp.com\/dataset_challenge\" target=\"_blank\">provided by Yelp<\/a>. This was the dataset that originally got me looking at Drill, since I was using it as an input to <a href=\"https:\/\/speakerdeck.com\/rmoff\/unlock-the-value-in-your-big-data-reservoir-using-oracle-big-data-discovery-and-oracle-big-data-spatial-and-graph\" target=\"_blank\">Big Data Discovery<\/a> (BDD) but struggling on two counts. First up was how best to define a suitable Hive table over it in order to ingest it to BDD. Following from this was trying to understand what value there might be in the data which would drive how long to spend perfecting the way in which I exposed the data in Hive. The examples below show the kind of complications that complex JSON can introduce when queried in a tabular fashion.<\/p>\n<p>First up, querying a JSON file, with the schema inferred automagically. Pretty cool.<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select * from `\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json` limit 5;\r\n+---------+------+-------------+-------+------+------+\r\n| user_id | text | business_id | likes | date | type |\r\n+---------+------+-------------+-------+------+------+\r\n| -6rEfobYjMxpUWLNxszaxQ | Don't waste your time. | cE27W9VPgO88Qxe4ol6y_g | 0 | 2013-04-18 | tip |\r\n| EZ0r9dKKtEGVx2CdnowPCw | Your GPS will not allow you to find this place. Put Rankin police department in instead. They are directly across the street. | mVHrayjG3uZ_RLHkLj-AMg | 1 | 2013-01-06 | tip |\r\n| xb6zEQCw9I-Gl0g06e1KsQ | Great drink specials! | KayYbHCt-RkbGcPdGOThNg | 0 | 2013-12-03 | tip |\r\n| QawZN4PSW7ng_9SP7pjsVQ | Friendly staff, good food, great beer selection, and relaxing atmosphere | KayYbHCt-RkbGcPdGOThNg | 0 | 2015-07-08 | tip |\r\n| MLQre1nvUtW-RqMTc4iC9A | Beautiful restoration. | 1_lU0-eSWJCRvNGk78Zh9Q | 0 | 2015-10-25 | tip |\r\n+---------+------+-------------+-------+------+------+\r\n5 rows selected (2.341 seconds)<\/pre>\n<p>We can use standard SQL aggregations such as <code>COUNT<\/code>:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select count(*) from `\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json`;\r\n+---------+\r\n| EXPR$0  |\r\n+---------+\r\n| 591864  |\r\n+---------+\r\n1 row selected (4.495 seconds)<\/pre>\n<p>as well as <code>GROUP BY<\/code> operation:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select `date`,count(*) as tip_count from `\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;\r\n+-------------+------------+\r\n|    date     | tip_count  |\r\n+-------------+------------+\r\n| 2012-07-21  | 719        |\r\n| 2012-05-19  | 718        |\r\n| 2012-08-04  | 699        |\r\n| 2012-06-23  | 690        |\r\n| 2012-07-28  | 682        |\r\n+-------------+------------+\r\n5 rows selected (7.111 seconds)<\/pre>\n<p>Digging into the data a bit, we can see that it&#8217;s not entirely flat &#8211; note, for example, the <code>hours<\/code> column, which is a nested JSON object:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select full_address,city,hours from `\/user\/oracle\/incoming\/yelp\/business_json` b limit 5;\r\n+--------------+------+-------+\r\n| full_address | city | hours |\r\n+--------------+------+-------+\r\n| 4734 Lebanon Church Rd\r\nDravosburg, PA 15034 | Dravosburg | {\"Friday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Tuesday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Thursday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Wednesday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Monday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Sunday\":{},\"Saturday\":{}} |\r\n| 202 McClure St\r\nDravosburg, PA 15034 | Dravosburg | {\"Friday\":{},\"Tuesday\":{},\"Thursday\":{},\"Wednesday\":{},\"Monday\":{},\"Sunday\":{},\"Saturday\":{}} |\r\n| 1 Ravine St\r\nDravosburg, PA 15034 | Dravosburg | {\"Friday\":{},\"Tuesday\":{},\"Thursday\":{},\"Wednesday\":{},\"Monday\":{},\"Sunday\":{},\"Saturday\":{}} |\r\n| 1530 Hamilton Rd\r\nBethel Park, PA 15234 | Bethel Park | {\"Friday\":{},\"Tuesday\":{},\"Thursday\":{},\"Wednesday\":{},\"Monday\":{},\"Sunday\":{},\"Saturday\":{}} |\r\n| 301 South Hills Village\r\nPittsburgh, PA 15241 | Pittsburgh | {\"Friday\":{\"close\":\"17:00\",\"open\":\"10:00\"},\"Tuesday\":{\"close\":\"21:00\",\"open\":\"10:00\"},\"Thursday\":{\"close\":\"17:00\",\"open\":\"10:00\"},\"Wednesday\":{\"close\":\"21:00\",\"open\":\"10:00\"},\"Monday\":{\"close\":\"21:00\",\"open\":\"10:00\"},\"Sunday\":{\"close\":\"18:00\",\"open\":\"11:00\"},\"Saturday\":{\"close\":\"21:00\",\"open\":\"10:00\"}} |\r\n+--------------+------+-------+\r\n5 rows selected (0.721 seconds)\r\n0: jdbc:drill:zk=local&gt;<\/pre>\n<p>With Apache Drill we can simply use dot notation to access nested values. It&#8217;s necessary to alias the table (<code>b<\/code> in this example) when you&#8217;re doing this:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select b.hours from `\/user\/oracle\/incoming\/yelp\/business_json` b limit 1;\r\n+-------+\r\n| hours |\r\n+-------+\r\n| {\"Friday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Tuesday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Thursday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Wednesday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Monday\":{\"close\":\"21:00\",\"open\":\"11:00\"},\"Sunday\":{},\"Saturday\":{}} |\r\n+-------+<\/pre>\n<p>Nested objects can themselves be nested &#8211; not a problem with Apache Drill, we just chain the dot notation further:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select b.hours.Friday from `\/user\/oracle\/incoming\/yelp\/business_json` b limit 1;\r\n+-----------------------------------+\r\n|              EXPR$0               |\r\n+-----------------------------------+\r\n| {\"close\":\"21:00\",\"open\":\"11:00\"}  |\r\n+-----------------------------------+\r\n1 row selected (0.238 seconds)<\/pre>\n<p>Note the use of backtick (<code>`<\/code>) to quote the reserved <code>open<\/code> and <code>close<\/code> keywords:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select b.hours.Friday.`open`,b.hours.Friday.`close` from `\/user\/oracle\/incoming\/yelp\/business_json` b limit 1;\r\n+---------+---------+\r\n| EXPR$0  | EXPR$1  |\r\n+---------+---------+\r\n| 11:00   | 21:00   |\r\n+---------+---------+\r\n1 row selected (0.58 seconds)<\/pre>\n<p>Nested columns are proper objects in their own right in the query, and can be used as predicates too:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select b.name,b.full_address,b.hours.Friday.`open` from `\/user\/oracle\/incoming\/yelp\/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;\r\n+------------------------+------------------------------------------------+---------+\r\n|          name          |                  full_address                  | EXPR$2  |\r\n+------------------------+------------------------------------------------+---------+\r\n| Mr Hoagie              | 4734 Lebanon Church Rd\r\nDravosburg, PA 15034    | 11:00   |\r\n| Alexion's Bar &amp; Grill  | 141 Hawthorne St\r\nGreentree\r\nCarnegie, PA 15106  | 11:00   |\r\n| Rocky's Lounge         | 1201 Washington Ave\r\nCarnegie, PA 15106         | 11:00   |\r\n| Papa J's               | 200 E Main St\r\nCarnegie\r\nCarnegie, PA 15106      | 11:00   |\r\n| Italian Village Pizza  | 2615 Main St\r\nHomestead, PA 15120               | 11:00   |\r\n+------------------------+------------------------------------------------+---------+\r\n5 rows selected (0.404 seconds)<\/pre>\n<p>You&#8217;ll notice in the above output that the <code>full_address<\/code> field has line breaks in &#8212; we can just use a <a href=\"https:\/\/drill.apache.org\/docs\/about-sql-function-examples\/\" target=\"_blank\">SQL Function<\/a> to replace line breaks with commas:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select b.name,regexp_replace(b.full_address,'\\n',','),b.hours.Friday.`open` from `\/user\/oracle\/incoming\/yelp\/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;\r\n+------------------------+------------------------------------------------+---------+\r\n|          name          |                     EXPR$1                     | EXPR$2  |\r\n+------------------------+------------------------------------------------+---------+\r\n| Mr Hoagie              | 4734 Lebanon Church Rd,Dravosburg, PA 15034    | 11:00   |\r\n| Alexion's Bar &amp; Grill  | 141 Hawthorne St,Greentree,Carnegie, PA 15106  | 11:00   |\r\n| Rocky's Lounge         | 1201 Washington Ave,Carnegie, PA 15106         | 11:00   |\r\n| Papa J's               | 200 E Main St,Carnegie,Carnegie, PA 15106      | 11:00   |\r\n| Italian Village Pizza  | 2615 Main St,Homestead, PA 15120               | 11:00   |\r\n+------------------------+------------------------------------------------+---------+\r\n5 rows selected (1.346 seconds)<\/pre>\n<h2>Query Federation<\/h2>\n<p>So Apache Drill enables you to run SQL queries against data in a multitude of formats and locations, which is rather useful in itself. But even better than that, it lets you federate these sources in a single query. Here&#8217;s an example of joining between data in HDFS and Oracle:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select X.text,\r\n. . . . . . . . . . . &gt; Y.NAME\r\n. . . . . . . . . . . &gt; from hdfs.`\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json` X\r\n. . . . . . . . . . . &gt; inner join ora.MOVIEDEMO.YELP_BUSINESS Y\r\n. . . . . . . . . . . &gt; on X.business_id = Y.BUSINESS_ID\r\n. . . . . . . . . . . &gt; where Y.NAME = 'Chick-fil-A'\r\n. . . . . . . . . . . &gt; limit 5;\r\n+--------------------------------------------------------------------+--------------+\r\n|                                text                                |     NAME     |\r\n+--------------------------------------------------------------------+--------------+\r\n| It's daddy daughter date night here and they go ALL OUT!           | Chick-fil-A  |\r\n| Chicken minis!  The best part of waking up Saturday mornings.  :)  | Chick-fil-A  |\r\n| Nice folks as always unlike those ghetto joints                    | Chick-fil-A  |\r\n| Great clean and delicious chicken sandwiches!                      | Chick-fil-A  |\r\n| Spicy Chicken with lettuce, tomato, and pepperjack cheese FTW!     | Chick-fil-A  |\r\n+--------------------------------------------------------------------+--------------+\r\n5 rows selected (3.234 seconds)<\/pre>\n<p>You can define a view over this:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; create or replace view dfs.tmp.yelp_tips as select X.text as tip_text, Y.NAME as business_name from hdfs.`\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json` X inner join ora.MOVIEDEMO.YELP_BUSINESS Y on X.business_id = Y.BUSINESS_ID ;\r\n+-------+-------------------------------------------------------------+\r\n|  ok   |                           summary                           |\r\n+-------+-------------------------------------------------------------+\r\n| true  | View 'yelp_tips' replaced successfully in 'dfs.tmp' schema  |\r\n+-------+-------------------------------------------------------------+\r\n1 row selected (0.574 seconds)\r\n0: jdbc:drill:zk=local&gt; describe dfs.tmp.yelp_tips;\r\n+----------------+--------------------+--------------+\r\n|  COLUMN_NAME   |     DATA_TYPE      | IS_NULLABLE  |\r\n+----------------+--------------------+--------------+\r\n| tip_text       | ANY                | YES          |\r\n| business_name  | CHARACTER VARYING  | YES          |\r\n+----------------+--------------------+--------------+\r\n2 rows selected (0.756 seconds)<\/pre>\n<p>and then query it as any regular object:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select tip_text,business_name from dfs.tmp.yelp_tips where business_name like '%Grill' limit 5;\r\n+------+------+\r\n| text | NAME |\r\n+------+------+\r\n| Great drink specials! | Alexion's Bar &amp; Grill |\r\n| Friendly staff, good food, great beer selection, and relaxing atmosphere | Alexion's Bar &amp; Grill |\r\n| Pretty quiet here... | Uno Pizzeria &amp; Grill |\r\n| I recommend this location for quick lunches. 10 min or less lunch menu. Soup bar ( all you can eat)  the broccoli cheddar soup is delicious. | Uno Pizzeria &amp; Grill |\r\n| Instead of pizza, come here for dessert. The deep dish sundae is really good. | Uno Pizzeria &amp; Grill |\r\n+------+------+\r\n5 rows selected (3.272 seconds)<\/pre>\n<h2>Querying Twitter JSON data<\/h2>\n<p>Here&#8217;s an example of using Drill to query a local file holding some Twitter data. You can <a href=\"https:\/\/gist.github.com\/b81b40df61d3faa1f4bf05193cfeec6d\" target=\"_blank\">download the file here<\/a> if you want to try querying it yourself.<\/p>\n<p>To start with I switched to using the <code>dfs<\/code> storage plugin:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; use dfs;\r\n+-------+----------------------------------+\r\n|  ok   |             summary              |\r\n+-------+----------------------------------+\r\n| true  | Default schema changed to [dfs]  |\r\n+-------+----------------------------------+<\/pre>\n<p>And then tried a select against the file. Note the <code>limit 5<\/code> clause &#8211; very useful when you&#8217;re just examining the structure of a file.<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select * from `\/user\/oracle\/incoming\/twitter\/geo_tweets.json` limit 5;\r\nError: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within\/between OBJECT entries\r\n\r\nFile  \/user\/oracle\/incoming\/twitter\/geo_tweets.json\r\nRecord  2819\r\nColumn  3503\r\nFragment 0:0<\/pre>\n<p>An error? That&#8217;s not supposed to happen. I&#8217;ve got a JSON file, right? It turns out the JSON file is one complete JSON object per line. Except that it&#8217;s not on the last record. Note the record count given in the error above &#8211; 2819:<\/p>\n<pre class=\"brush:java\">[oracle@bigdatalite ~]$ wc -l geo_tweets.json\r\n2818 geo_tweets.json<\/pre>\n<p>So the file only has 2818 complete lines. Hmmm. Let&#8217;s take a look at that record, using a head\/tail bash combo :<\/p>\n<pre class=\"brush:java; wrap-lines:false\">[oracle@bigdatalite ~]$ head -n 2819 geo_tweets.json |tail -n1\r\n{\"created_at\":\"Sun Jul 24 21:00:44 +0000 2016\",\"id\":757319630432067584,\"id_str\":\"757319630432067584\",\"text\":\"And now @HillaryClinton hires @DWStweets: Honorary Campaign Manager across the USA #corruption #hillarysamerica  https:\/\/t.co\/8jAGUu6w2f\",\"source\":\"TweetCaster for iOS\",\"truncated\":false,\"in_reply_to_status_id\":null,\"in_reply_to_status_id_str\":null,\"in_reply_to_user_id\":null,\"in_reply_to_user_id_str\":null,\"in_reply_to_screen_name\":null,\"user\":{\"id\":2170786369,\"id_str\":\"2170786369\",\"name\":\"Patricia Weber\",\"screen_name\":\"InnieBabyBoomer\",\"location\":\"Williamsburg, VA\",\"url\":\"http:\/\/lovesrantsandraves.blogspot.com\/\",\"description\":\"Baby Boomer, Swing Voter, Conservative, Spiritual, #Introvert, Wife, Grandma, Italian, \u2665 Books, Cars, Ferrari, F1 Race\u2665  #tcot\",\"protected\":false,\"verified\":false,\"followers_count\":861,\"friends_count\":918,\"listed_count\":22,\"favourites_count\":17,\"statuses_count\":2363,\"created_at\":\"Sat Nov 02 19:13:06 +0000 2013\",\"utc_offset\":null,\"time_zone\":null,\"geo_enabled\":true,\"lang\":\"en\",\"contributors_enabled\":false,\"is_translator\":false,\"profile_background_color\":\"C0DEED\",\"profile_background_image_url\":\"http:\/\/pbs.twimg.com\/profile_background_images\/378800000107659131\/3589f<\/pre>\n<p>That&#8217;s the complete data in the file &#8211; so Drill is right &#8211; the JSON is corrupted. If we drop that last record and create a new file (<code>geo_tweets.fixed.json<\/code>)<\/p>\n<p>and query it again, we get something!<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt;  select text from `\/users\/rmoff\/data\/geo_tweets.fixed.json` limit 5;\r\n+------+\r\n| text |\r\n+------+\r\n| Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https:\/\/t.co\/joI9GMfRim |\r\n| We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |\r\n| Donald Trump accepted the Republican nomination last night. Isis claimed responsibility. |\r\n| Obama: \"We must stand together and stop terrorism\"\r\nTrump: \"We don't want these people in our country\"\r\n\ufffd |\r\n| Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @\u2026 https:\/\/t.co\/qHWuJXnzbw |\r\n+------+\r\n5 rows selected (0.246 seconds)<\/pre>\n<p><code>text<\/code> here being one of the json fields. I could do a <code>select *<\/code> but it&#8217;s not so intelligable:<div style=\"display:inline-block; margin: 15px 0;\"> <div id=\"adngin-JavaCodeGeeks_incontent_video-0\" style=\"display:inline-block;\"><\/div> <\/div><\/p>\n<pre class=\"brush:java; wrap-lines:false\">0: jdbc:drill:zk=local&gt;  select * from `\/users\/rmoff\/data\/geo_tweets.fixed.json` limit 5;\r\n+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+\r\n| created_at | id | id_str | text | source | truncated | user | geo | coordinates | place | is_quote_status | retweet_count | favorite_count | entities | favorited | retweeted | possibly_sensitive | filter_level | lang | timestamp_ms | @version | @timestamp | user_name | user_screen_name | user_followers_count | user_friends_count | user_listed_count | user_favourites_count | user_statuses_count | user_created_at | place_name | place_country | hashtags_list | urls_array | urls_list | user_mentions_screen_name_list | longitude | latitude | hashtags_array | extended_entities | user_mentions_screen_name_array | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | retweeted_status | retweeted_screen_name | quoted_status_id | quoted_status_id_str | quoted_status |\r\n+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+\r\n| Fri Jul 22 19:37:11 +0000 2016 | 756573827589545984 | 756573827589545984 | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https:\/\/t.co\/joI9GMfRim | dlvr.it | false | {\"id\":67898674,\"id_str\":\"67898674\",\"name\":\"Vancouver Press\",\"screen_name\":\"Vancouver_CP\",\"location\":\"Vancouver, BC\",\"url\":\"http:\/\/vancouver.cityandpress.com\/\",\"description\":\"Latest news from Vancouver. Updates are frequent.\",\"protected\":false,\"verified\":false,\"followers_count\":807,\"friends_count\":13,\"listed_count\":94,\"favourites_count\":1,\"statuses_count\":131010,\"created_at\":\"Sat Aug 22 14:25:37 +0000 2009\",\"utc_offset\":-25200,\"time_zone\":\"Pacific Time (US &amp; Canada)\",\"geo_enabled\":true,\"lang\":\"en\",\"contributors_enabled\":false,\"is_translator\":false,\"profile_background_color\":\"FFFFFF\",\"profile_background_image_url\":\"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png\",\"profile_background_image_url_https\":\"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png\",\"profile_background_tile\":false,\"profile_link_color\":\"8A1C3B\",\"profile_sidebar_border_color\":\"FFFFFF\",\"profile_sidebar_fill_color\":\"FFFFFF\",\"profile_text_color\":\"2A2C31\",\"profile_use_background_image\":false,\"profile_image_url\":\"http:\/\/pbs.twimg.com\/profile_images\/515841109553983490\/_t0QWPco_normal.png\",\"profile_image_url_https\":\"https:\/\/pbs.twimg.com\/profile_images\/515841109553983490\/_t0QWPco_normal.png\",\"profile_banner_url\":\"https:\/\/pbs.twimg.com\/profile_banners\/67898674\/1411821103\",\"default_profile\":false,\"default_profile_image\":false} | {\"type\":\"Point\",\"coordinates\":[49.2814375,-123.12109067]} | {\"type\":\"Point\",\"coordinates\":[-123.12109067,49.2814375]} | {\"id\":\"1e5cb4d0509db554\",\"url\":\"https:\/\/api.twitter.com\/1.1\/geo\/id\/1e5cb4d0509db554.json\",\"place_type\":\"city\",\"name\":\"Vancouver\",\"full_name\":\"Vancouver, British Columbia\",\"country_code\":\"CA\",\"country\":\"Canada\",\"bounding_box\":{\"type\":\"Polygon\",\"coordinates\":[[[-123.224215,49.19854],[-123.224215,49.316738],[-123.022947,49.316738],[-123.022947,49.19854]]]},\"attributes\":{}} | false | 0 | 0 | {\"urls\":[{\"url\":\"https:\/\/t.co\/joI9GMfRim\",\"expanded_url\":\"http:\/\/toplocalnow.com\/ca\/vancouver?section=trends\",\"display_url\":\"toplocalnow.com\/ca\/vancouver?s\u2026\",\"indices\":[70,93]}],\"hashtags\":[],\"user_mentions\":[],\"media\":[],\"symbols\":[]} | false | false | false | low | en | 1469216231616 | 1 | 2016-07-22T19:37:11.000Z | Vancouver Press | Vancouver_CP | 807 | 13 | 94 | 1 | 131010 | Sat Aug 22 14:25:37 +0000 2009 | Vancouver | Canada |  | [\"toplocalnow.com\/ca\/vancouver?s\u2026\"] | toplocalnow.com\/ca\/vancouver?s\u2026 |  | -123.12109067 | 49.2814375 | [] | {\"media\":[]} | [] | null | null | null | null | null | {\"user\":{},\"entities\":{\"user_mentions\":[],\"media\":[],\"hashtags\":[],\"urls\":[]},\"extended_entities\":{\"media\":[]},\"quoted_status\":{\"user\":{},\"entities\":{\"hashtags\":[],\"user_mentions\":[],\"media\":[],\"urls\":[]},\"extended_entities\":{\"media\":[]}}} | null | null | null | {\"user\":{},\"entities\":{\"user_mentions\":[],\"media\":[],\"urls\":[],\"hashtags\":[]},\"extended_entities\":{\"media\":[]},\"place\":{\"bounding_box\":{\"coordinates\":[]},\"attributes\":{}},\"geo\":{\"coordinates\":[]},\"coordinates\":{\"coordinates\":[]}} |<\/pre>\n<p>Within the twitter data there&#8217;s root-level fields, such as <code>text<\/code>, as well as nested ones such as information about the tweeter in the <code>user<\/code> field. As we saw above you reference nested fields using dot notation. Now&#8217;s a good time to point out a couple of common mistakes that you may encounter. The first is not quoting reserved words, and is the first thing to check for if you get an error such as <code>Encountered \".\"<\/code>:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt;  select user.screen_name,text from `\/users\/rmoff\/data\/geo_tweets.fixed.json` limit 5;\r\nError: PARSE ERROR: Encountered \".\" at line 1, column 12.\r\n[...]<\/pre>\n<p>Second is declaring the table alias when using dot notation &#8211; if you don&#8217;t then Apache Drill thinks that the parent column is actually the table name (<code>VALIDATION ERROR: [...] Table 'user' not found<\/code>):<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt;  select `user`.screen_name,text from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` limit 5;\r\nAug 10, 2016 11:16:45 PM org.apache.calcite.sql.validate.SqlValidatorException \r\nSEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'user' not found\r\nAug 10, 2016 11:16:45 PM org.apache.calcite.runtime.CalciteException \r\nSEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 13: Table 'user' not found\r\nError: VALIDATION ERROR: From line 1, column 8 to line 1, column 13: Table 'user' not found\r\n\r\nSQL Query null\r\n\r\n[Error Id: 1427fd23-e180-40be-a751-b6f1f838233a on 192.168.56.1:31010] (state=,code=0)<\/pre>\n<p>With those mistakes fixed, we can see the user&#8217;s screenname:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt;  select tweets.`user`.`screen_name` as user_screen_name,text from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets limit 2;\r\n+------------------+------+\r\n| user_screen_name | text |\r\n+------------------+------+\r\n| Vancouver_CP     | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https:\/\/t.co\/joI9GMfRim |\r\n| tmj_TUC_skltrd   | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |\r\n+------------------+------+\r\n2 rows selected (0.256 seconds)\r\n0: jdbc:drill:zk=local&gt;<\/pre>\n<p>As well as nested objects, JSON supports arrays. An example of this in twitter data is hashtags, or URLs, both of which there can be zero, one, or many of in a given tweet.<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt;  select tweets.entities.hashtags from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets limit 5;\r\n+--------+\r\n| EXPR$0 |\r\n+--------+\r\n| [] |\r\n| [{\"text\":\"hiring\",\"indices\":[6,13]},{\"text\":\"Job\",\"indices\":[98,102]},{\"text\":\"SkilledTrade\",\"indices\":[103,116]},{\"text\":\"Tucson\",\"indices\":[117,124]},{\"text\":\"Jobs\",\"indices\":[129,134]}] |\r\n| [] |\r\n| [] |\r\n| [{\"text\":\"lol\",\"indices\":[72,76]},{\"text\":\"nowthatsfunny\",\"indices\":[77,91]}] |\r\n+--------+\r\n5 rows selected (0.286 seconds)<\/pre>\n<p>Using the <code>FLATTEN<\/code> function each array entry becomes a new row, thus:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt;  select flatten(tweets.entities.hashtags) from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets limit 5;\r\n+----------------------------------------------+\r\n|                    EXPR$0                    |\r\n+----------------------------------------------+\r\n| {\"text\":\"hiring\",\"indices\":[6,13]}           |\r\n| {\"text\":\"Job\",\"indices\":[98,102]}            |\r\n| {\"text\":\"SkilledTrade\",\"indices\":[103,116]}  |\r\n| {\"text\":\"Tucson\",\"indices\":[117,124]}        |\r\n| {\"text\":\"Jobs\",\"indices\":[129,134]}          |\r\n+----------------------------------------------+\r\n5 rows selected (0.139 seconds)<\/pre>\n<p>Note that the <code>limit 5<\/code> clause is showing only the first five array instances, which is actually just hashtags from the first tweet in the above list.<\/p>\n<p>To access the text of the hashtag we use a subquery and the dot notation to access the <code>text<\/code> field:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select ent_hashtags.hashtags.text from (select flatten(tweets.entities.hashtags) as hashtags from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets) as ent_hashtags limit 5;\r\n+---------------+\r\n|    EXPR$0     |\r\n+---------------+\r\n| hiring        |\r\n| Job           |\r\n| SkilledTrade  |\r\n| Tucson        |\r\n| Jobs          |\r\n+---------------+\r\n5 rows selected (0.168 seconds)<\/pre>\n<p>This can be made more readable by using <a href=\"https:\/\/drill.apache.org\/docs\/with-clause\/\" target=\"_blank\">Common Table Expressions<\/a> (CTE, also known as subquery factoring) for the same result:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets)\r\n. . . . . . . . . . . &gt; select ent_hashtags.hashtags.text from ent_hashtags\r\n. . . . . . . . . . . &gt; limit 5;\r\n+---------------+\r\n|    EXPR$0     |\r\n+---------------+\r\n| hiring        |\r\n| Job           |\r\n| SkilledTrade  |\r\n| Tucson        |\r\n| Jobs          |\r\n+---------------+\r\n5 rows selected (0.253 seconds)<\/pre>\n<p>Combining the flattened array with existing fields enables us to see things like a list of tweets with their associated hashtags:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp limit 10;\r\n+------------------+------+---------+\r\n| user_screen_name | text | hashtag |\r\n+------------------+------+---------+\r\n| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | hiring |\r\n| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |\r\n| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | SkilledTrade |\r\n| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Tucson |\r\n| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Jobs |\r\n| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @\u2026 https:\/\/t.co\/qHWuJXnzbw | lol |\r\n| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @\u2026 https:\/\/t.co\/qHWuJXnzbw | nowthatsfunny |\r\n| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to\u2026 https:\/\/t.co\/6OVl7crshw #ws @winston_salem_ https:\/\/t.co\/l5h220otj4 | WinstonSalem |\r\n| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to\u2026 https:\/\/t.co\/6OVl7crshw #ws @winston_salem_ https:\/\/t.co\/l5h220otj4 | ws |\r\n| trendinaliaSG | 6. Hit The Stage\r\n7. TTTT\r\n8. Demi Lovato\r\n9. Beijing\r\n10. Donald Trump\r\n\r\n2016\/7\/23 03:36 SGT #trndnl https:\/\/t.co\/psP0GzBgZB | trndnl |\r\n+------------------+------+---------+\r\n10 rows selected (0.166 seconds)<\/pre>\n<p>We can also filter based on hashtag:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp where tmp.hashtags.text = 'Job' limit 5;\r\n+------------------+------+---------+\r\n| user_screen_name | text | hashtag |\r\n+------------------+------+---------+\r\n| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |\r\n| tmj_VAL_health | Want to work at Genesis Rehab Services? We're #hiring in #Clinton, MD! Click for details: https:\/\/t.co\/4lt7I4gMZk #Job #Healthcare #Jobs | Job |\r\n| tmj_in_retail | Want to work in #Clinton, IN? View our latest opening: https:\/\/t.co\/UiimnlubYs #Job #Retail #Jobs #Hiring #CareerArc | Job |\r\n| tmj_la_hrta | Want to work at SONIC Drive-In? We're #hiring in #Clinton, LA! Click for details: https:\/\/t.co\/aQ1FrWc7iR #Job #SONIC #Hospitality #Jobs | Job |\r\n| tmj_ia_hrta | We're #hiring! Click to apply: Department Manager - https:\/\/t.co\/SnoKcwwHFk #Job #Hospitality #Clinton, IA #Jobs #CareerArc | Job |\r\n+------------------+------+---------+\r\n5 rows selected (0.207 seconds)<\/pre>\n<p>as well as summarise hashtag counts:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets)\r\n. . . . . . . . . . . &gt; select ent_hashtags.hashtags.text,count(ent_hashtags.hashtags.text) from ent_hashtags\r\n. . . . . . . . . . . &gt; group by ent_hashtags.hashtags.text\r\n. . . . . . . . . . . &gt; order by 2 desc;\r\n+-----------------------------+---------+\r\n|           EXPR$0            | EXPR$1  |\r\n+-----------------------------+---------+\r\n| Trump                       | 365     |\r\n| trndnl                      | 176     |\r\n| job                         | 170     |\r\n| Hiring                      | 127     |\r\n| Clinton                     | 108     |\r\n| Yorkshire                   | 100     |\r\n| CareerArc                   | 100     |\r\n[...]<\/pre>\n<p>To filter out records that may not have array values (such as hashtags, which not every tweet has) and without with the query may fail, use <code>IS NOT NULL<\/code> against an attribute of first index of the array:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select tweets.entities.hashtags from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;\r\n+--------+\r\n| EXPR$0 |\r\n+--------+\r\n| [{\"text\":\"hiring\",\"indices\":[6,13]},{\"text\":\"Job\",\"indices\":[98,102]},{\"text\":\"SkilledTrade\",\"indices\":[103,116]},{\"text\":\"Tucson\",\"indices\":[117,124]},{\"text\":\"Jobs\",\"indices\":[129,134]}] |\r\n| [{\"text\":\"lol\",\"indices\":[72,76]},{\"text\":\"nowthatsfunny\",\"indices\":[77,91]}] |\r\n| [{\"text\":\"WinstonSalem\",\"indices\":[0,13]},{\"text\":\"ws\",\"indices\":[92,95]}] |\r\n| [{\"text\":\"trndnl\",\"indices\":[89,96]}] |\r\n| [{\"text\":\"trndnl\",\"indices\":[92,99]}] |\r\n+--------+\r\n5 rows selected (0.187 seconds)<\/pre>\n<p>If you try and compare the array itself, it doesn&#8217;t work:<\/p>\n<p><code> 0: jdbc:drill:zk=local&gt; select tweets.entities.hashtags from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets where tweets.entities.hashtags is not null limit 5; Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index -1. Error: Missing function implementation: [isnotnull(MAP-REPEATED)]. Full expression: --UNKNOWN EXPRESSION--.. Fragment 0:0 [Error Id: 99ac12aa-f6b4-4692-b815-8f483da682c4 on 192.168.56.1:31010] (state=,code=0) <\/code><\/p>\n<p>The above example demonstrates using array indexing, which is an alternative to <code>FLATTEN<\/code> for accessing individual objects in the array if you know they&#8217;re going to exist:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select tweets.entities.hashtags[0].text as first_hashtag,text from dfs.`\/users\/rmoff\/data\/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;\r\n+---------------+------+\r\n| first_hashtag | text |\r\n+---------------+------+\r\n| hiring | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https:\/\/t.co\/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |\r\n| lol | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @\u2026 https:\/\/t.co\/qHWuJXnzbw |\r\n| WinstonSalem | #WinstonSalem Time and place announced for Donald Trump's visit to\u2026 https:\/\/t.co\/6OVl7crshw #ws @winston_salem_ https:\/\/t.co\/l5h220otj4 |<\/pre>\n<h2>Querying CSV files<\/h2>\n<p>JSON files are relatively easy to interpret because they have a semi-defined schema within them, including column names. CSV (and character delimited files in general), on the other hand, are a bit more of a &#8216;wild west&#8217; when it comes to reliably inferring column names. You can <a href=\"https:\/\/drill.apache.org\/docs\/text-files-csv-tsv-psv\/\" target=\"_blank\">configure Apache Drill<\/a> to ignore the first line of a CSV file (on the assumption that it&#8217;s a header) if you want to, or to take them as column names. If you don&#8217;t do this and query a CSV file that looks like this:<\/p>\n<pre class=\"brush:java; wrap-lines:false\">[oracle@bigdatalite ~]$ head nyc_parking_violations.csv\r\nSummons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect    ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation\r\n1360858775,PHW9801,OH,PAS,07\/01\/2015,20,SUBN,HONDA,P,61490,26160,26190,0,0044,44,44,929822,0044,0000,0653P,,BX,O,651,RIVER AVE,,0,408,D,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,<\/pre>\n<p>You&#8217;ll get two records, each one column wide, as an array:<\/p>\n<pre class=\"brush:java; wrap-lines:false\">0: jdbc:drill:zk=local&gt; select *  from `\/user\/oracle\/incoming\/nyc_parking\/nyc_parking_violations.csv` LIMIT 5;\r\n+---------+\r\n| columns |\r\n+---------+\r\n| [\"Summons Number\",\"Plate ID\",\"Registration State\",\"Plate Type\",\"Issue Date\",\"Violation Code\",\"Vehicle Body Type\",\"Vehicle Make\",\"Issuing Agency\",\"Street Code1\",\"Street Code2\",\"Street Code3\",\"Vehicle Expiration Date\",\"Violation Location\",\"Violation Precinct\",\"Issuer Precinct\",\"Issuer Code\",\"Issuer Command\",\"Issuer Squad\",\"Violation Time\",\"Time First Observed\",\"Violation County\",\"Violation In Front Of Or Opposite\",\"House Number\",\"Street Name\",\"Intersecting Street\",\"Date First Observed\",\"Law Section\",\"Sub Division\",\"Violation Legal Code\",\"Days Parking In Effect    \",\"From Hours In Effect\",\"To Hours In Effect\",\"Vehicle Color\",\"Unregistered Vehicle?\",\"Vehicle Year\",\"Meter Number\",\"Feet From Curb\",\"Violation Post Code\",\"Violation Description\",\"No Standing or Stopping Violation\",\"Hydrant Violation\",\"Double Parking Violation\"] |\r\n| [\"1360858775\",\"PHW9801\",\"OH\",\"PAS\",\"07\/01\/2015\",\"20\",\"SUBN\",\"HONDA\",\"P\",\"61490\",\"26160\",\"26190\",\"0\",\"0044\",\"44\",\"44\",\"929822\",\"0044\",\"0000\",\"0653P\",\"\",\"BX\",\"O\",\"651\",\"RIVER AVE\",\"\",\"0\",\"408\",\"D\",\"\",\"BBBBBBB\",\"ALL\",\"ALL\",\"\",\"0\",\"0\",\"-\",\"0\",\"\",\"\",\"\",\"\",\"\"] |<\/pre>\n<p>To access the actual columns in the CSV file you need to use <code>columns[x]<\/code> syntax to reference them. Watch out that <code>columns<\/code> is case-sensitive, and the numbering is zero-based:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `\/user\/oracle\/incoming\/nyc_parking\/nyc_parking_violations.csv` limit 5;\r\n+----------+--------------------+\r\n| PlateID  | RegistrationState  |\r\n+----------+--------------------+\r\n| AR877A   | NJ                 |\r\n| 73268ME  | NY                 |\r\n| 2050240  | IN                 |\r\n| 2250017  | IN                 |\r\n| AH524C   | NJ                 |\r\n+----------+--------------------+\r\n5 rows selected (0.247 seconds)<\/pre>\n<p>To make it easier to work with the data on a repeated basis you can define a view over the data:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; create view dfs.tmp.NYC_Parking_01 as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `\/user\/oracle\/incoming\/nyc_parking\/nyc_parking_violations.csv`;\r\n+-------+-----------------------------------------------------------------+\r\n|  ok   |                             summary                             |\r\n+-------+-----------------------------------------------------------------+\r\n| true  | View 'NYC_Parking_01' created successfully in 'dfs.tmp' schema  |\r\n+-------+-----------------------------------------------------------------+\r\n1 row selected (0.304 seconds)<\/pre>\n<p>This is using the <code>dfs<\/code> storage plugin and the <code>tmp<\/code> schema within it, which has the following storage configuration &#8211; note that <code>writeable<\/code> is true<\/p>\n<pre class=\"brush:java\">\"tmp\": {\r\n  \"location\": \"\/tmp\",\r\n  \"writable\": true,\r\n  \"defaultInputFormat\": null\r\n}<\/pre>\n<p>(if you use the wrong database [storage plugin] or schema you&#8217;ll get <code>Schema [hdfs] is immutable.<\/code>)<\/p>\n<p>Query the new view<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select * from dfs.tmp.NYC_Parking_01 limit 5;\r\n+-----------+---------------------+\r\n|  PlateID  |  RegistrationState  |\r\n+-----------+---------------------+\r\n| Plate ID  | Registration State  |\r\n| PHW9801   | OH                  |\r\n| K8010F    | TN                  |\r\n| GFG6211   | NY                  |\r\n| GHL1805   | NY                  |\r\n+-----------+---------------------+\r\n5 rows selected (0.191 seconds)<\/pre>\n<p>Through the view, or direct against the CSV path, you can also run aggregates:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) &gt; 1 limit 1;\r\n+----------+---------+\r\n| PlateID  | EXPR$1  |\r\n+----------+---------+\r\n| 2050240  | 4       |\r\n+----------+---------+\r\n1 row selected (15.983 seconds)<\/pre>\n<p>Although this isn&#8217;t rerunnable for the same result &#8211; probably because of the <code>limit<\/code> clause<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) &gt; 1 limit 1;\r\n+----------+---------+\r\n| PlateID  | EXPR$1  |\r\n+----------+---------+\r\n| AR877A   | 3       |\r\n+----------+---------+\r\n1 row selected (12.881 seconds)<\/pre>\n<p>Under the covers the view definition is written to <code>\/tmp<\/code> &#8211; you&#8217;ll want to move this path if you&#8217;re wanting to preserve this data past reboot:<\/p>\n<pre class=\"brush:java\">[oracle@bigdatalite parking]$ cat \/tmp\/NYC_Parking_01.view.drill\r\n{\r\n  \"name\" : \"NYC_Parking_01\",\r\n  \"sql\" : \"SELECT `columns`[1] AS `PlateID`, `columns`[2] AS `RegistrationState`\\nFROM `\/user\/oracle\/incoming\/nyc_parking\/nyc_parking_violations.csv`\",\r\n  \"fields\" : [ {\r\n    \"name\" : \"PlateID\",\r\n    \"type\" : \"ANY\",\r\n    \"isNullable\" : true\r\n  }, {\r\n    \"name\" : \"RegistrationState\",\r\n    \"type\" : \"ANY\",\r\n    \"isNullable\" : true\r\n  } ],\r\n  \"workspaceSchemaPath\" : [ \"hdfs\" ]<\/pre>\n<p>You can also create an actual table using CTAS (Create Table As Select):<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; create table dfs.tmp.parking as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `\/user\/oracle\/incoming\/nyc_parking\/nyc_parking_violations.csv`;\r\n+-----------+----------------------------+\r\n| Fragment  | Number of records written  |\r\n+-----------+----------------------------+\r\n| 1_1       | 4471875                    |\r\n| 1_0       | 4788421                    |\r\n+-----------+----------------------------+\r\n2 rows selected (42.913 seconds)<\/pre>\n<p>This is stored on disk (per the <code>dfs<\/code> config) and by default in Parquet format:<\/p>\n<pre class=\"brush:java\">[oracle@bigdatalite parking]$ ls -l \/tmp\/parking\/\r\ntotal 76508\r\n-rw-r--r--. 1 oracle oinstall 40623288 Aug 10 22:53 1_0_0.parquet\r\n-rw-r--r--. 1 oracle oinstall 37717804 Aug 10 22:53 1_1_0.parquet<\/pre>\n<h2>Drill&#8217;s Web Interface<\/h2>\n<p>Drill comes with a web interface which you can access at http:\/\/<\/p>\n<p>:8047\/ and is useful for<\/p>\n<ul>\n<li>Issuing queries<br \/>\n<a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-1.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60443 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-1-1024x588.png\" alt=\"introduction-apache-drill-1\" width=\"620\" height=\"356\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-1-1024x588.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-1-300x172.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-1-768x441.png 768w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-1.png 1654w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><br \/>\n<a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-2.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60444 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-2-1024x592.png\" alt=\"introduction-apache-drill-2\" width=\"620\" height=\"358\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-2-1024x592.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-2-300x173.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-2-768x444.png 768w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-2.png 1660w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><\/li>\n<li>Configuring additional storage plugins (e.g. <a href=\"https:\/\/drill.apache.org\/docs\/rdbms-storage-plugin\/\" target=\"_blank\">database<\/a>, <a href=\"https:\/\/drill.apache.org\/docs\/file-system-storage-plugin\/\" target=\"_blank\">hdfs<\/a>, etc)<br \/>\n<a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-3.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60445 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-3-1024x761.png\" alt=\"introduction-apache-drill-3\" width=\"620\" height=\"461\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-3-1024x761.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-3-300x223.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-3-768x571.png 768w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-3.png 1666w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><\/li>\n<li>Metrics and debug<br \/>\n<a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-4.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60446 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-4-1024x599.png\" alt=\"introduction-apache-drill-4\" width=\"620\" height=\"363\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-4-1024x599.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-4-300x175.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-4-768x449.png 768w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-4.png 1652w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><br \/>\n<a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-5.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60447 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-5-1024x466.png\" alt=\"introduction-apache-drill-5\" width=\"620\" height=\"282\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-5-1024x466.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-5-300x136.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-5-768x349.png 768w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-5.png 1666w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><\/li>\n<\/ul>\n<h2>Defining Storage Plugins<\/h2>\n<p>From the Drill web interface you can view existing storage plugins, or define new ones. To create a new one, enter its name (for example, <code>hdfs<\/code>, but could be <code>fred<\/code> for all that it matters &#8211; it&#8217;s just a label) under <b>New Storage Plugin<\/b> on the Storage page, and click on Create. Paste the necessary JSON definition in the Configuration box, and then click Create. If you don&#8217;t want to use the GUI there&#8217;s also a REST API.<\/p>\n<p>Storage plugin configuration is stored either within Zookeeper (when running Drill distributed), or locally in the <code>sys.store.provider.local.path<\/code> path when running standalone. By default this is under <code>\/tmp<\/code> which gets cleared down at server reboot. To persist custom storage configurations amend the <code>sys.store.provider.local.path<\/code> in <code>drill-override.conf<\/code>, for example:<\/p>\n<pre class=\"brush:java\">drill.exec: {\r\n    cluster-id: \"drillbits1\",\r\n    zk.connect: \"localhost:2181\"\r\n    sys.store.provider.local.path=\"\/home\/oracle\/drill\/\"\r\n}<\/pre>\n<h3>Working with filesystem data<\/h3>\n<p>Here&#8217;s an example of a storage configuration that enables Drill to access a CDH cluster&#8217;s HDFS:<\/p>\n<pre class=\"brush:java\">{\r\n  \"type\": \"file\",\r\n  \"enabled\": true,\r\n  \"connection\": \"hdfs:\/\/cdh57-01-node-01:8020\/\",\r\n  \"config\": null,\r\n  \"workspaces\": {\r\n    \"root\": {\r\n      \"location\": \"\/\",\r\n      \"writable\": true,\r\n      \"defaultInputFormat\": null\r\n    }\r\n  },\r\n  \"formats\": {\r\n    \"csv\": {\r\n      \"type\": \"text\",\r\n      \"extensions\": [\r\n        \"csv\"\r\n      ],\r\n      \"delimiter\": \",\"\r\n    },\r\n    \"json\": {\r\n      \"type\": \"json\",\r\n      \"extensions\": [\r\n        \"json\"\r\n      ]\r\n    }\r\n  }\r\n}<\/pre>\n<p>As well as the <code>connection<\/code> parameter itself for HDFS, the important bit in this configuration is the <code>formats<\/code> section. This tells Drill how to interpet files that it finds, without the end-user having to explicitly declare their type.<\/p>\n<p>For the filesystem-based plugin <code>dfs<\/code> (which can include local files, HDFS, even Amazon S3), you can browse the available &#8220;tables&#8221;:<\/p>\n<p>List the files in HDFS (previously selected with <code>use hdfs;<\/code>)<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; show files;\r\n+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+\r\n|  name  | isDirectory  | isFile  | length  | owner  |    group    | permissions  |       accessTime       |     modificationTime     |\r\n+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+\r\n| hbase  | true         | false   | 0       | hbase  | supergroup  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-07-25 14:46:08.212  |\r\n| share  | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwx    | 1969-12-31 19:00:00.0  | 2016-05-15 12:28:08.152  |\r\n| solr   | true         | false   | 0       | solr   | solr        | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-06-01 18:34:50.716  |\r\n| tmp    | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwt    | 1969-12-31 19:00:00.0  | 2016-06-24 04:54:41.491  |\r\n| user   | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwx    | 1969-12-31 19:00:00.0  | 2016-06-21 15:55:59.084  |\r\n| var    | true         | false   | 0       | hdfs   | supergroup  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-11 17:53:29.804  |\r\n+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+\r\n6 rows selected (0.145 seconds)<\/pre>\n<p>Show files in a given path:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; show files in `\/user\/oracle`;\r\n+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+\r\n|      name      | isDirectory  | isFile  | length  |  owner  |  group  | permissions  |       accessTime       |     modificationTime     |\r\n+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+\r\n| .Trash         | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-23 20:42:34.815  |\r\n| .sparkStaging  | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-07-06 03:56:38.863  |\r\n| .staging       | true         | false   | 0       | oracle  | oracle  | rwx------    | 1969-12-31 19:00:00.0  | 2016-06-01 18:37:04.005  |\r\n| incoming       | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-08-03 05:34:12.38   |\r\n| mediademo      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-06-01 18:59:45.653  |\r\n| moviedemo      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:02:55.652  |\r\n| moviework      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.497  |\r\n| oggdemo        | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.552  |\r\n| oozie-oozi     | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.651  |\r\n+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+\r\n9 rows selected (0.428 seconds)<\/pre>\n<p>You can also query across multiple files by specifying a wildcard match. Here&#8217;s the truncated list of files available:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:&gt; show files in `hdfs`.`\/user\/rmoff\/incoming\/twitter\/2016\/06\/17\/tweets\/`;\r\n+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+\r\n|           name           | isDirectory  | isFile  |  length  | owner  | group  | permissions  |        accessTime        |     modificationTime     |\r\n+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+\r\n| FlumeData.1466176113171  | false        | true    | 1055675  | rmoff  | rmoff  | rw-r--r--    | 2016-08-10 21:28:27.072  | 2016-06-17 16:08:38.023  |\r\n| FlumeData.1466176113172  | false        | true    | 1051411  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.756  | 2016-06-17 16:08:40.597  |\r\n| FlumeData.1466176113173  | false        | true    | 1054734  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.752  | 2016-06-17 16:08:43.33   |\r\n| FlumeData.1466176113174  | false        | true    | 1050991  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.743  | 2016-06-17 16:08:44.361  |\r\n| FlumeData.1466176113175  | false        | true    | 1053577  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.748  | 2016-06-17 16:08:45.162  |\r\n| FlumeData.1466176113176  | false        | true    | 1051965  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.752  | 2016-06-17 16:08:46.261  |\r\n| FlumeData.1466176113177  | false        | true    | 1049555  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:47.425  |\r\n| FlumeData.1466176113178  | false        | true    | 1050566  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:48.23   |\r\n| FlumeData.1466176113179  | false        | true    | 1051751  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.756  | 2016-06-17 16:08:49.381  |\r\n| FlumeData.1466176113180  | false        | true    | 1052249  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.757  | 2016-06-17 16:08:50.042  |\r\n| FlumeData.1466176113181  | false        | true    | 1055002  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:50.896  |\r\n| FlumeData.1466176113182  | false        | true    | 1050812  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:52.191  |\r\n| FlumeData.1466176113183  | false        | true    | 1048954  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.757  | 2016-06-17 16:08:52.994  |\r\n| FlumeData.1466176113184  | false        | true    | 1051559  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.773  | 2016-06-17 16:08:54.025  |\r\n[...]<\/pre>\n<p>Count number of records in one file (<code>FlumeData.1466176113171<\/code>):<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:&gt; SELECT count(*) FROM table(`hdfs`.`\/user\/rmoff\/incoming\/twitter\/2016\/06\/17\/tweets\/FlumeData.1466176113171`(type =&gt; 'json'));\r\n+---------+\r\n| EXPR$0  |\r\n+---------+\r\n| 277     |\r\n+---------+\r\n1 row selected (0.798 seconds)<\/pre>\n<p>In several files (<code>FlumeData.146617611317*<\/code>):<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:&gt; SELECT count(*) FROM table(`hdfs`.`\/user\/rmoff\/incoming\/twitter\/2016\/06\/17\/tweets\/FlumeData.146617611317*`(type =&gt; 'json'));\r\n+---------+\r\n| EXPR$0  |\r\n+---------+\r\n| 2415    |\r\n+---------+\r\n1 row selected (2.466 seconds)<\/pre>\n<p>In all files in the folder (<code>*<\/code>):<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:&gt; SELECT count(*) FROM table(`hdfs`.`\/user\/rmoff\/incoming\/twitter\/2016\/06\/17\/tweets\/*`(type =&gt; 'json'));\r\n+---------+\r\n| EXPR$0  |\r\n+---------+\r\n| 7414    |\r\n+---------+\r\n1 row selected (3.867 seconds)<\/pre>\n<p>And even across multiple folders:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:&gt; SELECT count(*) FROM table(`hdfs`.`\/user\/flume\/incoming\/twitter\/2016\/06\/*\/*`(type =&gt; 'json'));\r\n+---------+\r\n| EXPR$0  |\r\n+---------+\r\n| 206793  |\r\n+---------+\r\n1 row selected (87.545 seconds)<\/pre>\n<h3>Querying data without an identifying extension<\/h3>\n<p>Drill relies on the <code>format<\/code> clause of the storage extension configurations in orer to determine how to interpret files based on their extensions. You won&#8217;t always have that luxury of extensions being available, or being defined. If you try and query such data, you&#8217;ll not get far. In this example I&#8217;m querying data on HDFS that&#8217;s in JSON format but without the <code>.json<\/code> suffix:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:&gt; SELECT text FROM `hdfs`.`\/user\/rmoff\/incoming\/twitter\/2016\/06\/17\/tweets\/FlumeData.1466176113171` limit 5;\r\nError: VALIDATION ERROR: From line 1, column 18 to line 1, column 23: Table 'hdfs.\/user\/rmoff\/incoming\/twitter\/2016\/06\/17\/tweets\/FlumeData.1466176113171' not found\r\n\r\nSQL Query null<\/pre>\n<p>Fear not &#8211; you can declare them as part of the query syntax.<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:&gt; SELECT text FROM table(`hdfs`.`\/user\/rmoff\/incoming\/twitter\/2016\/06\/17\/tweets\/FlumeData.1466176113171`(type =&gt; 'json')) limit 5;\r\n+------+\r\n| text |\r\n+------+\r\n| RT @jjkukrl: susu bayi jg lagi mahal nih ugh ayah harus semangat cari duit ^^9 https:\/\/t.co\/2NvTOShRbI |\r\n| Oracle Java 1Z0-808 Web Exam Simulator https:\/\/t.co\/tZ3gU8EMj3 |\r\n| @TribuneSelatan ahaha kudu gaya atuh da arek lebarann ahahaha |\r\n| Short impression of yesterday's speech. What a great day it was! #lifeatoracle #team #salesincentive #oracle https:\/\/t.co\/SVK2ovOe3U |\r\n| Want to work at Oracle? We're #hiring in New York! Click for details: https:\/\/t.co\/NMTo1WMHVw #Sales #Job #Jobs #CareerArc |\r\n+------+\r\n5 rows selected (1.267 seconds)<\/pre>\n<h3>Storage Configuration &#8211; Oracle<\/h3>\n<p>Per <a href=\"https:\/\/drill.apache.org\/docs\/rdbms-storage-plugin\/\" target=\"_blank\">the documentation<\/a> it&#8217;s easy to query data residing in a RDBMS, such as Oracle. Simply copy the JDBC driver into Apache Drill&#8217;s jar folder:<\/p>\n<pre class=\"brush:java\">cp \/u01\/app\/oracle\/product\/12.1.0.2\/dbhome_1\/jdbc\/lib\/ojdbc7.jar \/opt\/apache-drill-1.7.0\/jars\/3rdparty\/<\/pre>\n<p>And then add the necessary storage configuration, which I called ora:<\/p>\n<pre class=\"brush:java\">{\r\n  \"type\": \"jdbc\",\r\n  \"driver\": \"oracle.jdbc.OracleDriver\",\r\n  \"url\": \"jdbc:oracle:thin:moviedemo\/welcome1@localhost:1521\/ORCL\",\r\n  \"username\": null,\r\n  \"password\": null,\r\n  \"enabled\": true\r\n}<\/pre>\n<p>If you get an error <code>Please retry: error (unable to create\/ update storage)<\/code> then check that the target Oracle database is up, the password is correct, and so on.<\/p>\n<p>You can then query the data within Hive:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; use ora.MOVIEDEMO;\r\n+-------+--------------------------------------------+\r\n|  ok   |                  summary                   |\r\n+-------+--------------------------------------------+\r\n| true  | Default schema changed to [ora.MOVIEDEMO]  |\r\n+-------+--------------------------------------------+\r\n1 row selected (0.205 seconds)\r\n\r\n0: jdbc:drill:zk=local&gt; show tables;\r\n+----------------+-----------------------------+\r\n|  TABLE_SCHEMA  |         TABLE_NAME          |\r\n+----------------+-----------------------------+\r\n| ora.MOVIEDEMO  | ACTIVITY                    |\r\n| ora.MOVIEDEMO  | BDS_CUSTOMER_RFM            |\r\n| ora.MOVIEDEMO  | BUSINESS_REVIEW_SUMMARY     |\r\n[...]\r\n\r\n0: jdbc:drill:zk=local&gt; select * from ACTIVITY limit 5;\r\n+--------------+---------+\r\n| ACTIVITY_ID  |  NAME   |\r\n+--------------+---------+\r\n| 3.0          | Pause   |\r\n| 6.0          | List    |\r\n| 7.0          | Search  |\r\n| 8.0          | Login   |\r\n| 9.0          | Logout  |\r\n+--------------+---------+\r\n5 rows selected (1.644 seconds)<\/pre>\n<p>If you get Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query. then enable verbose errors in Apache Drill to see what the problem is:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; ALTER SESSION SET `exec.errors.verbose` = true;\r\n+-------+-------------------------------+\r\n|  ok   |            summary            |\r\n+-------+-------------------------------+\r\n| true  | exec.errors.verbose updated.  |\r\n+-------+-------------------------------+\r\n1 row selected (0.154 seconds)\r\n\r\n0: jdbc:drill:zk=local&gt; select * from ora.MOVIEDEMO.YELP_BUSINESS limit 1;\r\nError: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query.\r\n\r\nsql SELECT *\r\nFROM \"MOVIEDEMO\".\"YELP_BUSINESS\"\r\nplugin ora\r\nFragment 0:0\r\n\r\n[Error Id: 40343dd5-1354-48ed-90ef-77ae1390411b on bigdatalite.localdomain:31010]\r\n\r\n(java.sql.SQLException) ORA-29913: error in executing ODCIEXTTABLEOPEN callout\r\nORA-29400: data cartridge error\r\nKUP-11504: error from external driver: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused<\/pre>\n<p>Here the problem was with the external table that Oracle was querying (<code>ORA-29913: error in executing ODCIEXTTABLEOPEN<\/code>). It&#8217;s actually an Oracle external table over a Hive table, which obviously Drill could be querying directly &#8211; but hey, we&#8217;re just sandboxing here&#8230;<\/p>\n<h2>Query Execution<\/h2>\n<p>Just as Oracle has its Cost Based Optimiser (CBO) which helps it determine how to execute a query, and do so most efficiently, Apache Drill has an execution engine that determines how to actually execute the query you give it. This also includes how to split it up over multiple nodes (&#8220;drillbits&#8221;) if available, as well as optimisations such as <a href=\"https:\/\/drill.apache.org\/docs\/how-to-partition-data\/\" target=\"_blank\">partition pruning<\/a> in certain cases. You can read more about <a href=\"https:\/\/drill.apache.org\/docs\/drill-query-execution\/\" target=\"_blank\">how the query execution works here<\/a>, and watch a <a href=\"https:\/\/www.youtube.com\/watch?v=0rurIzOkTIg\" target=\"_blank\">good explanation of it here<\/a>. To view the explain plan for a query use <code>explain plan<\/code>:<\/p>\n<pre class=\"brush:java; wrap-lines:false\">0: jdbc:drill:zk=local&gt; !set maxwidth 10000  \r\n0: jdbc:drill:zk=local&gt; explain plan for select `date`,count(*) as tip_count from `\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;  \r\n+------+------+\r\n| text | json |\r\n+------+------+\r\n| 00-00    Screen\r\n00-01      Project(date=[$0], tip_count=[$1])  \r\n00-02        SelectionVectorRemover  \r\n00-03          Limit(fetch=[5])  \r\n00-04            SelectionVectorRemover  \r\n00-05              TopN(limit=[5])  \r\n00-06                HashAgg(group=[{0}], tip_count=[$SUM0($1)])  \r\n00-07                  HashAgg(group=[{0}], tip_count=[COUNT()])  \r\n00-08                    Scan(groupscan=[EasyGroupScan [selectionRoot=hdfs:\/\/localhost:8020\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json, numFiles=1, columns=[`date`], files=[hdfs:\/\/localhost:8020\/user\/oracle\/incoming\/yelp\/tip_json\/yelp_academic_dataset_tip.json]]])  \r\n | {\r\n  \"head\" : {\r\n    \"version\" : 1,\r\n    \"generator\" : {\r\n      \"type\" : \"ExplainHandler\",\r\n      \"info\" : \"\"\r\n    [...]<\/pre>\n<p>You can also use the Drill web interface to see information about how a query executed:<\/p>\n<p><a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-6.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60448 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-6-1024x856.png\" alt=\"introduction-apache-drill-6\" width=\"620\" height=\"518\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-6-1024x856.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-6-300x251.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-6-768x642.png 768w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-6.png 1654w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><\/p>\n<p><a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-7.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60449 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-7-1024x943.png\" alt=\"introduction-apache-drill-7\" width=\"620\" height=\"571\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-7-1024x943.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-7-300x276.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-7-768x707.png 768w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-7.png 1812w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><\/p>\n<h3>Drill Explorer<\/h3>\n<p>The MapR Drill ODBC driver comes with a tool called <b><a href=\"https:\/\/drill.apache.org\/docs\/drill-explorer-introduction\/\" target=\"_blank\">Drill Explorer<\/a><\/b>. This is a GUI that enables you to explore the data by navigating the databases (==storage plugins) and folders\/files within, previewing the data and even creating views on it.<\/p>\n<p><a href=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-8.png\"><img decoding=\"async\" class=\"aligncenter wp-image-60450 size-large\" src=\"http:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-8-1024x602.png\" alt=\"introduction-apache-drill-8\" width=\"620\" height=\"364\" srcset=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-8-1024x602.png 1024w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-8-300x176.png 300w, https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2016\/09\/introduction-apache-drill-8-768x452.png 768w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/a><\/p>\n<h3>Drill Client<\/h3>\n<p>Within the Drill client there are various settings available:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; !set  \r\nautocommit          true  \r\nautosave            false  \r\ncolor               true  \r\nfastconnect         true  \r\nforce               false  \r\nheaderinterval      100  \r\nhistoryfile         \/home\/oracle\/.sqlline\/history  \r\nincremental         true  \r\nisolation           TRANSACTION_REPEATABLE_READ  \r\nmaxcolumnwidth      15  \r\nmaxheight           56  \r\nmaxwidth            1000000  \r\nnumberformat        default  \r\noutputformat        table  \r\npropertiesfile      \/home\/oracle\/.sqlline\/sqlline.properties  \r\nrowlimit            0  \r\nshowelapsedtime     true  \r\nshowheader          true  \r\nshownestederrs      false  \r\nshowwarnings        true  \r\nsilent              false  \r\ntimeout             -1  \r\ntrimscripts         true  \r\nverbose             false<\/pre>\n<p>To change one, such as the width of output displayed:<\/p>\n<pre class=\"brush:java\">0: jdbc:drill:zk=local&gt; !set maxwidth 10000<\/pre>\n<p>To connect to remote Drill specify the Zookeeper node(s) that store the Drillbit connection information:<\/p>\n<pre class=\"brush:java\">rmoff@asgard-3:apache-drill-1.7.0&gt; bin\/sqlline -u jdbc:drill:zk=cdh57-01-node-01.moffatt.me:2181,cdh57-01-node-02.moffatt.me:2181,cdh57-01-node-03.moffatt.me:2181<\/pre>\n<h2>Conclusion<\/h2>\n<p>Apache Drill is a powerful tool for using familiar querying language (SQL) against different data sources. On a small scale, simply being able to slice and dice through structured files like JSON is a massive win. On a larger scale, it will be interesting to experiment with how Apache Drill compares when querying larger volumes of data across a cluster of machines, maybe compared to a tool such as Impala.<\/p>\n<p>For more information about Apache Drill see <a href=\"\/blog\/essential-tips-using-apache-drill-obiee-12c\" target=\"_blank\">how to access Drill from within OBIEE.<\/a><\/p>\n<div class=\"attribution\">\n<table>\n<tbody>\n<tr>\n<td><span class=\"reference\">Reference: <\/span><\/td>\n<td><a href=\"https:\/\/www.mapr.com\/blog\/how-guide-getting-started-apache-drill\">How to Guide: Getting Started with Apache Drill<\/a> from our <a href=\"http:\/\/www.javacodegeeks.com\/join-us\/jcg\/\">JCG partner<\/a> Robin Moffatt at the <a href=\"http:\/\/www.mapr.com\/blog\">Mapr<\/a> blog.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It&#8217;s not just a wanna-be SQL interface that trips over at anything complex &#8211; it&#8217;s a hugely functional one including support for many built in functions as well as windowing functions. Whilst it can connect &hellip;<\/p>\n","protected":false},"author":1094,"featured_media":112,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[1088],"class_list":["post-60388","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-enterprise-java","tag-apache-drill"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>How to Guide: Getting Started with Apache Drill - Java Code Geeks<\/title>\n<meta name=\"description\" content=\"Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It&#039;s not just a wanna-be SQL interface\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Guide: Getting Started with Apache Drill - Java Code Geeks\" \/>\n<meta property=\"og:description\" content=\"Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It&#039;s not just a wanna-be SQL interface\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html\" \/>\n<meta property=\"og:site_name\" content=\"Java Code Geeks\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/javacodegeeks\" \/>\n<meta property=\"article:published_time\" content=\"2016-09-27T16:00:25+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/enterprise-java-logo.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"150\" \/>\n\t<meta property=\"og:image:height\" content=\"150\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Robin Moffatt\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:site\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Robin Moffatt\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"38 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html\"},\"author\":{\"name\":\"Robin Moffatt\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/fea91f47997f13d107701e2e0e087193\"},\"headline\":\"How to Guide: Getting Started with Apache Drill\",\"datePublished\":\"2016-09-27T16:00:25+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html\"},\"wordCount\":2443,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/enterprise-java-logo.jpg\",\"keywords\":[\"Apache Drill\"],\"articleSection\":[\"Enterprise Java\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html\",\"name\":\"How to Guide: Getting Started with Apache Drill - Java Code Geeks\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/enterprise-java-logo.jpg\",\"datePublished\":\"2016-09-27T16:00:25+00:00\",\"description\":\"Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It's not just a wanna-be SQL interface\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#primaryimage\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/enterprise-java-logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2012\\\/10\\\/enterprise-java-logo.jpg\",\"width\":150,\"height\":150,\"caption\":\"java-interview-questions-answers\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/2016\\\/09\\\/guide-getting-started-apache-drill.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Enterprise Java\",\"item\":\"https:\\\/\\\/www.javacodegeeks.com\\\/category\\\/java\\\/enterprise-java\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"How to Guide: Getting Started with Apache Drill\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#website\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"name\":\"Java Code Geeks\",\"description\":\"Java Developers Resource Center\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\"},\"alternateName\":\"JCG\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.javacodegeeks.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#organization\",\"name\":\"Exelixis Media P.C.\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"contentUrl\":\"https:\\\/\\\/www.javacodegeeks.com\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/exelixis-logo.png\",\"width\":864,\"height\":246,\"caption\":\"Exelixis Media P.C.\"},\"image\":{\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/javacodegeeks\",\"https:\\\/\\\/x.com\\\/javacodegeeks\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.javacodegeeks.com\\\/#\\\/schema\\\/person\\\/fea91f47997f13d107701e2e0e087193\",\"name\":\"Robin Moffatt\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/43c9ad042966af59586d0a526cc96399fd8e42c77f9e42079f3d75e1aa140638?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/43c9ad042966af59586d0a526cc96399fd8e42c77f9e42079f3d75e1aa140638?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/43c9ad042966af59586d0a526cc96399fd8e42c77f9e42079f3d75e1aa140638?s=96&d=mm&r=g\",\"caption\":\"Robin Moffatt\"},\"description\":\"Robin Moffatt is Head of R&amp;D (Europe) at Rittman Mead, and an Oracle ACE. His particular interests are analytics, systems architecture, administration, and performance optimization.\",\"url\":\"https:\\\/\\\/www.javacodegeeks.com\\\/author\\\/robin-moffatt\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Guide: Getting Started with Apache Drill - Java Code Geeks","description":"Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It's not just a wanna-be SQL interface","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html","og_locale":"en_US","og_type":"article","og_title":"How to Guide: Getting Started with Apache Drill - Java Code Geeks","og_description":"Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It's not just a wanna-be SQL interface","og_url":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html","og_site_name":"Java Code Geeks","article_publisher":"https:\/\/www.facebook.com\/javacodegeeks","article_published_time":"2016-09-27T16:00:25+00:00","og_image":[{"width":150,"height":150,"url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/enterprise-java-logo.jpg","type":"image\/jpeg"}],"author":"Robin Moffatt","twitter_card":"summary_large_image","twitter_creator":"@javacodegeeks","twitter_site":"@javacodegeeks","twitter_misc":{"Written by":"Robin Moffatt","Est. reading time":"38 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#article","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html"},"author":{"name":"Robin Moffatt","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/fea91f47997f13d107701e2e0e087193"},"headline":"How to Guide: Getting Started with Apache Drill","datePublished":"2016-09-27T16:00:25+00:00","mainEntityOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html"},"wordCount":2443,"commentCount":0,"publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/enterprise-java-logo.jpg","keywords":["Apache Drill"],"articleSection":["Enterprise Java"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html","url":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html","name":"How to Guide: Getting Started with Apache Drill - Java Code Geeks","isPartOf":{"@id":"https:\/\/www.javacodegeeks.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#primaryimage"},"image":{"@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#primaryimage"},"thumbnailUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/enterprise-java-logo.jpg","datePublished":"2016-09-27T16:00:25+00:00","description":"Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It's not just a wanna-be SQL interface","breadcrumb":{"@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#primaryimage","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/enterprise-java-logo.jpg","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2012\/10\/enterprise-java-logo.jpg","width":150,"height":150,"caption":"java-interview-questions-answers"},{"@type":"BreadcrumbList","@id":"https:\/\/www.javacodegeeks.com\/2016\/09\/guide-getting-started-apache-drill.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.javacodegeeks.com\/"},{"@type":"ListItem","position":2,"name":"Java","item":"https:\/\/www.javacodegeeks.com\/category\/java"},{"@type":"ListItem","position":3,"name":"Enterprise Java","item":"https:\/\/www.javacodegeeks.com\/category\/java\/enterprise-java"},{"@type":"ListItem","position":4,"name":"How to Guide: Getting Started with Apache Drill"}]},{"@type":"WebSite","@id":"https:\/\/www.javacodegeeks.com\/#website","url":"https:\/\/www.javacodegeeks.com\/","name":"Java Code Geeks","description":"Java Developers Resource Center","publisher":{"@id":"https:\/\/www.javacodegeeks.com\/#organization"},"alternateName":"JCG","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.javacodegeeks.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.javacodegeeks.com\/#organization","name":"Exelixis Media P.C.","url":"https:\/\/www.javacodegeeks.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","contentUrl":"https:\/\/www.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","width":864,"height":246,"caption":"Exelixis Media P.C."},"image":{"@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/javacodegeeks","https:\/\/x.com\/javacodegeeks"]},{"@type":"Person","@id":"https:\/\/www.javacodegeeks.com\/#\/schema\/person\/fea91f47997f13d107701e2e0e087193","name":"Robin Moffatt","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/43c9ad042966af59586d0a526cc96399fd8e42c77f9e42079f3d75e1aa140638?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/43c9ad042966af59586d0a526cc96399fd8e42c77f9e42079f3d75e1aa140638?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/43c9ad042966af59586d0a526cc96399fd8e42c77f9e42079f3d75e1aa140638?s=96&d=mm&r=g","caption":"Robin Moffatt"},"description":"Robin Moffatt is Head of R&amp;D (Europe) at Rittman Mead, and an Oracle ACE. His particular interests are analytics, systems architecture, administration, and performance optimization.","url":"https:\/\/www.javacodegeeks.com\/author\/robin-moffatt"}]}},"_links":{"self":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/60388","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/users\/1094"}],"replies":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/comments?post=60388"}],"version-history":[{"count":0,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/60388\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media\/112"}],"wp:attachment":[{"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/media?parent=60388"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/categories?post=60388"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.javacodegeeks.com\/wp-json\/wp\/v2\/tags?post=60388"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}