0% found this document useful (0 votes)
19 views69 pages

Building A Data Science Toolbox

Uploaded by

Riyani Sandiyana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views69 pages

Building A Data Science Toolbox

Uploaded by

Riyani Sandiyana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Building a

Jeroen Janssens
@jeroenhjanssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Overview

- Data science at the command line


- Data Science Toolbox
- Building your own data science toolbox

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Science at the


Command Line

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data science is OSEMN

- Obtaining data
- Scrubbing data
- Exploring data
- Modeling data
- iNterpreting data

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Command line on Mac OS X

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Command line on Ubuntu

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The command line is awesome

- Play with your data (REPL)


- Combine tools
- Many tools available
- Automatable
- Many servers run GNU/Linux
- One overarching environment

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Essential Tools and


Concepts

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Command-line tool is an umbrella term

- Executable
- Script
- One-liner
- Shell command
- Shell function
- Alias

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Unix philosophy

Write command-line tools that:


- Do one thing and do it well
- Work together
- Handle text streams

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Tips dataset
$ cat tips.csv
bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
23.68,3.31,Male,No,Sun,Dinner,2
24.59,3.61,Female,No,Sun,Dinner,4
25.29,4.71,Male,No,Sun,Dinner,4
8.77,2.0,Male,No,Sun,Dinner,2
26.88,3.12,Male,No,Sun,Dinner,4
15.04,1.96,Male,No,Sun,Dinner,2
14.78,3.23,Male,No,Sun,Dinner,2
10.27,1.71,Male,No,Sun,Dinner,2
35.26,5.0,Female,No,Sun,Dinner,4
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reference manual
$ man cat
CAT(1) User Commands CAT(1)

NAME
cat - concatenate files and print on the standard
output
SYNOPSIS
cat [OPTION]... [FILE]...
DESCRIPTION
Concatenate FILE(s), or standard input, to stand
ard output.

-A, --show-all
equivalent to -vET
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Looking at files
$ cat tips.csv | csvlook
|--------+------+--------+--------+------+--------+-------|
| bill | tip | sex | smoker | day | time | size |
|--------+------+--------+--------+------+--------+-------|
| 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 21.01 | 3.5 | Male | No | Sun | Dinner | 3 |
| 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
| 8.77 | 2.0 | Male | No | Sun | Dinner | 2 |
| 26.88 | 3.12 | Male | No | Sun | Dinner | 4 |
| 15.04 | 1.96 | Male | No | Sun | Dinner | 2 |
| 14.78 | 3.23 | Male | No | Sun | Dinner | 2 |
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Looking at files
$ cat tips.csv | less
$ cat tips.csv | head -n 3 | csvlook
|--------+------+--------+--------+-----+--------+-------|
| bill | tip | sex | smoker | day | time | size |
|--------+------+--------+--------+-----+--------+-------|
| 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
|--------+------+--------+--------+-----+--------+-------|
$ < tips.csv tail -n 3 | csvlook -H
|--------+------+--------+-----+------+--------+----|
| 22.67 | 2.0 | Male | Yes | Sat | Dinner | 2 |
| 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
| 18.78 | 3.0 | Female | No | Thur | Dinner | 2 |
|--------+------+--------+-----+------+--------+----|
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Filtering lines
$ grep 'Lunch' tips.csv | csvlook -H
|--------+------+--------+-----+------+-------+----|
| 27.2 | 4.0 | Male | No | Thur | Lunch | 4 |
| 22.76 | 3.0 | Male | No | Thur | Lunch | 2 |
| 17.29 | 2.71 | Male | No | Thur | Lunch | 2 |
| 19.44 | 3.0 | Male | Yes | Thur | Lunch | 2 |
| 16.66 | 3.4 | Male | No | Thur | Lunch | 2 |
| 10.07 | 1.83 | Female | No | Thur | Lunch | 1 |
| 32.68 | 5.0 | Male | Yes | Thur | Lunch | 2 |
| 15.98 | 2.03 | Male | No | Thur | Lunch | 2 |
| 34.83 | 5.17 | Female | No | Thur | Lunch | 4 |
| 13.03 | 2.0 | Male | No | Thur | Lunch | 2 |
| 18.28 | 4.0 | Male | No | Thur | Lunch | 2 |
| 24.71 | 5.85 | Male | No | Thur | Lunch | 2 |
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Filtering lines
$ cat tips.csv | awk -F, '$7 !~ /[1-4]/' | csvlook
|--------+------+--------+--------+------+--------+-------|
| bill | tip | sex | smoker | day | time | size |
|--------+------+--------+--------+------+--------+-------|
| 29.8 | 4.2 | Female | No | Thur | Lunch | 6 |
| 34.3 | 6.7 | Male | No | Thur | Lunch | 6 |
| 41.19 | 5.0 | Male | No | Thur | Lunch | 5 |
| 27.05 | 5.0 | Female | No | Thur | Lunch | 6 |
| 29.85 | 5.14 | Female | No | Sun | Dinner | 5 |
| 48.17 | 5.0 | Male | No | Sun | Dinner | 6 |
| 20.69 | 5.0 | Male | No | Sun | Dinner | 5 |
| 30.46 | 2.0 | Male | Yes | Sun | Dinner | 5 |
| 28.15 | 3.0 | Male | Yes | Sat | Dinner | 5 |
|--------+------+--------+--------+------+--------+-------|
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Filtering lines
$ csvgrep -c size -r "[1-4]" -i tips.csv | csvlook
|--------+------+--------+--------+------+--------+-------|
| bill | tip | sex | smoker | day | time | size |
|--------+------+--------+--------+------+--------+-------|
| 29.8 | 4.2 | Female | No | Thur | Lunch | 6 |
| 34.3 | 6.7 | Male | No | Thur | Lunch | 6 |
| 41.19 | 5.0 | Male | No | Thur | Lunch | 5 |
| 27.05 | 5.0 | Female | No | Thur | Lunch | 6 |
| 29.85 | 5.14 | Female | No | Sun | Dinner | 5 |
| 48.17 | 5.0 | Male | No | Sun | Dinner | 6 |
| 20.69 | 5.0 | Male | No | Sun | Dinner | 5 |
| 30.46 | 2.0 | Male | Yes | Sun | Dinner | 5 |
| 28.15 | 3.0 | Male | Yes | Sat | Dinner | 5 |
|--------+------+--------+--------+------+--------+-------|
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Extracting columns
$ csvgrep -c size -r "[1-4]" -i tips.csv > size56.csv
$ cut size56.csv -d, -f1,2
bill,tip
29.8,4.2
34.3,6.7
41.19,5.0
27.05,5.0
29.85,5.14
48.17,5.0
20.69,5.0
30.46,2.0
28.15,3.0

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Extracting columns
$ awk -F, '{print $1","$2}' size56.csv
bill,tip
29.8,4.2
34.3,6.7
41.19,5.0
27.05,5.0
29.85,5.14
48.17,5.0
20.69,5.0
30.46,2.0
28.15,3.0

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Extracting columns
$ csvcut size56.csv -c bill,tip
bill,tip
29.8,4.2
34.3,6.7
41.19,5.0
27.05,5.0
29.85,5.14
48.17,5.0
20.69,5.0
30.46,2.0
28.15,3.0

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Extracting words
$ curl -s 'http://www.gutenberg.org/cache/epub/76/pg76.txt'|
> tee finn | grep -oE '\w+' | tee words
The
Project
Gutenberg
EBook
of
Adventures
of
Huckleberry
Finn
Complete
by
Mark
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sorting and counting


$ wc finn
12361 114266 610157 finn

$ < words grep '^a' | grep 'e$' | sort | uniq -c | sort -rn
77 are
21 alone
20 ashore
19 above
13 alive
9 awhile
9 apiece
7 axe
7 agree
5 anywhere
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Replacing data
$ < finn tr '[a-z]' '[A-Z]' > /dev/null
$ < finn tr '[:lower:]' '[:upper:]' | head -n 14

THE PROJECT GUTENBERG EBOOK OF ADVENTURES OF HUCKLEBERRY FINN,


BY MARK TWAIN (SAMUEL CLEMENS)

THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE AT NO COST AND WIT
NO RESTRICTIONS WHATSOEVER. YOU MAY COPY IT, GIVE IT AWAY OR RE
IT UNDER THE TERMS OF THE PROJECT GUTENBERG LICENSE INCLUDED WI
EBOOK OR ONLINE AT WWW.GUTENBERG.NET

TITLE: ADVENTURES OF HUCKLEBERRY FINN, COMPLETE

AUTHOR: MARK TWAIN (SAMUEL CLEMENS)


Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Replacing data
$ < finn sed 's/ /_/g' | head -n 14

The_Project_Gutenberg_EBook_of_Adventures_of_Huckleberry_Finn,_
by_Mark_Twain_(Samuel_Clemens)

This_eBook_is_for_the_use_of_anyone_anywhere_at_no_cost_and_wit
no_restrictions_whatsoever._You_may_copy_it,_give_it_away_or_re
it_under_the_terms_of_the_Project_Gutenberg_License_included_wi
eBook_or_online_at_www.gutenberg.net

Title:_Adventures_of_Huckleberry_Finn,_Complete

Author:_Mark_Twain_(Samuel_Clemens)

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Summing values
$ < tips.csv | tail -n +2 | cut -d, -f1 | paste -s -d+
16.99+10.34+21.01+23.68+24.59+25.29+8.77+26.88+15.04+14.78+
10.27+35.26+15.42+18.43+14.83+21.58+10.33+16.29+16.97+20.65
+17.92+20.29+15.77+39.42+19.82+17.81+13.37+12.69+21.7+19.65
+9.55+18.35+15.06+20.69+17.78+24.06+16.31+16.93+18.69+ ...

$ < tips.csv | tail -n +2 | cut -d, -f1 | paste -s -d+ | bc


4827.77

$ < tips.csv awk -F, '{ sum+=$1} END {print sum}'


4827.77

$ < tips.csv Rio -e 'sum(df$bill)'


[1] 4827.77
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Example: Web Scraping

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Extracting data from HTML

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Download HTML using curl


$ curl -s 'http://en.wikipedia.org/wiki/List_of_countries_an
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>List of countries and territo
<meta name="generator" content="MediaWiki 1.23wmf10" />
<link rel="alternate" type="application/x-wiki" title="Edit
<link rel="edit" title="Edit this page" href="/w/index.php?t
<link rel="apple-touch-icon" href="//bits.wikimedia.org/appl
<link rel="shortcut icon" href="//bits.wikimedia.org/favicon
<link rel="search" type="application/opensearchdescription+x
<link rel="EditURI" type="application/rsd+xml" href="//en.wi
<link rel="copyright" href="//creativecommons.org/licenses/b

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Scrape element with CSS selectors


$ < wiki.html scrape -b -e 'table.wikitable > \
> tr:not(:first-child)'
<!DOCTYPE html>
<html>
<body>
<tr>
<td>1</td>
<td>Vatican City</td>
<td>3.2</td>
<td>0.44</td>
<td>7.2727273</td>
</tr>

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Convert to JSON using xml2json


$ < table.html xml2json | jq '.'
{
"html": {
"body": {
"tr": [
{
"td": [
{
"$t": "1"
},
{
"$t": "Vatican City"
},

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Transform JSON using jq


$ < table.json jq -c '.html.body.tr[] | {country: .td[1][],
> border: .td[2][], surface: .td[3][], ratio: .td[4][]}'
{"ratio":"7.2727273","surface":"0.44","border":"3.2","countr
{"ratio":"2.2000000","surface":"2","border":"4.4","country":
{"ratio":"0.6393443","surface":"61","border":"39","country":
{"ratio":"0.4750000","surface":"160","border":"76","country"
{"ratio":"0.3000000","surface":"34","border":"10.2","country
{"ratio":"0.2570513","surface":"468","border":"120.3","count
{"ratio":"0.2000000","surface":"6","border":"1.2","country":
{"ratio":"0.1888889","surface":"54","border":"10.2","country
{"ratio":"0.1388244","surface":"2586","border":"359","countr
{"ratio":"0.0749196","surface":"6220","border":"466","countr

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Convert to CSV with json2csv


$ < countries.json json2csv -p -k border,surface | csvlook
|----------+-----------|
| border | surface |
|----------+-----------|
| 3.2 | 0.44 |
| 4.4 | 2 |
| 39 | 61 |
| 76 | 160 |
| 10.2 | 34 |
| 120.3 | 468 |
| 1.2 | 6 |
| 10.2 | 54 |
| 359 | 2586 |
| 466 | 6220 |
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Behold, the beast

$ curl -s 'http://en.wikipedia.org/wiki/List_of_countries
> _and_territories_by_border/area_ratio' |
> scrape -be 'table.wikitable > tr:not(:first-child)' |
> xml2json | jq -c '.html.body.tr[] | {country: .td[1][],
> border: .td[2][], surface: .td[3][], ratio: .td[4][]}' |
> json2csv -p -k=border,surface | csvlook

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Exploration

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Statistics at the command line


$ < tips.csv tail -n +2 | cut -d, -f2 | qstats
Min. 1
1st Qu. 2
Median 2.9
Mean 2.99828
3rd Qu. 3.575
Max. 10
Range 9
Std Dev. 1.3808
Length 244

$ < tips.csv | tail -n +2 | cut -d, -f2 | qstats -m


2.99828

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Statistics at the command line


$ < tips.csv tail -n +2 | cut -d, -f2 | histogram.py -b10
NumSamples = 244; Min = 1.00; Max = 10.00
Mean = 2.998279; Variance = 1.906609; SD = 1.380800
each * represents a count of 1
1.0000 - 1.9000 [41]: ************************************
1.9000 - 2.8000 [79]: ************************************
2.8000 - 3.7000 [66]: ************************************
3.7000 - 4.6000 [27]: ***************************
4.6000 - 5.5000 [19]: *******************
5.5000 - 6.4000 [ 5]: *****
6.4000 - 7.3000 [ 4]: ****
7.3000 - 8.2000 [ 1]: *
8.2000 - 9.1000 [ 1]: *
9.1000 - 10.0000 [ 1]: *
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Rio: Making R part of the pipeline

$ < tips.csv Rio -se 'sqldf("select time,count(*) from


> df group by time;")'
time,count(*)
Dinner,176
Lunch,68

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Rio: Making R part of the pipeline

$ < tips.csv Rio -se 'sqldf("select time,count(*) from


> df group by time;")'
time,count(*)
Dinner,176
Lunch,68
$ < tips.csv | csvcut -c time | tail -n+2 | sort | uniq -c
176 Dinner
68 Lunch

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ggplot at the command line


$ < tips.csv Rio -ge 'g+geom_point(aes(total_bill,tip,
> colour=sex))+facet_wrap(~ time)' | display

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Science Toolbox

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

- Writing Data Science at the Command Line


- Isolated environment for executing code
- Share environment with readers
- Shell script to install command-line tools
- Turn shell script into more generic solution

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Science Toolbox 0.1.5

- Virtual environment for data science


- Locally and in the cloud
- Open source (BSD license)
- http://datasciencetoolbox.org
- @DataSciToolbox

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Standing on the shoulders of giants

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sensible base

Data Science Toolbox currently contains:


- Python scientific stack
- R
- dst command-line tool

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Software and data bundles

Collection of software and/or data related to:


- Book
- Course
- Organization

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Software and data bundles

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Locally or in the cloud?


- Locally
- Need to share resources
- No internet connection needed
- Completely free
- In the cloud
- Larger machines possible
- Probably not free
- Long running experiments
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Getting Started
(See also http://datasciencetoolbox.org)

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Download and install VirtualBox and Vagrant

- https://www.virtualbox.org/wiki/Downloads
- http://www.vagrantup.com/downloads.html

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Download and start the Data Science Toolbox

Create directory:
$ mkdir MyDataScienceToolbox
$ cd MyDataScienceToolbox

Download and start:


$ vagrant init data-science-toolbox/dst
$ vagrant up

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Log in
On Mac OS X and Linux:
$ vagrant ssh

On Microsoft Windows:
- Download putty.exe
- Enter:
- Host Name (or IP address): 127.0.0.1
- Port: 2222
- Connection type: SSH
- Username and password: vagrant
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Install additional software and bundles

Ubuntu and Python packages:


vagrant@data-science-toolbox:~$ sudo apt-get install cowsay
vagrant@data-science-toolbox:~$ sudo pip install networkx

R packages:
vagrant@data-science-toolbox:~$ R
> install.packages('stringr')

Bundles:
vagrant@data-science-toolbox:~$ dst add dsatcl

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Building your own Data


Science Toolbox

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Optimizing your environment

- Terminal, shell, and prompt


- Aliases, functions, and scripts
- Shortcuts

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Custom terminal, shell, and prompt

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Aliases
alias l '/bin/ls -ltrFsA'
alias mi 'mv -i'
alias up "cd .."
alias fox "open -a 'Firefox' \!:*"

# spelling while typing is hard


alias alais alias
alias moer more
alias mroe more
alias pu up

#alias onion 'open http://www.theonion.com/content/index'


alias onion echo "back to work"

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Shortcuts

$ cd ~/some/very/deep/often-used/directory
$ mark deep

$ jump deep

$ unmark deep

$ marks
deep -> /home/jeroen/some/very/deep/often-used/directory
foo -> /usr/bin/foo/bar

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Shortcuts
export MARKPATH=$HOME/.marks
function mark {
mkdir -p "$MARKPATH"; ln -s "$(pwd)" "$MARKPATH/$1"
}
function jump {
cd -P "$MARKPATH/$1" 2>/dev/null ||
echo "No such mark: $1"
}
function unmark {
rm -i "$MARKPATH/$1"
}
function marks {
ls -l "$MARKPATH" | sed 's/ / /g' |
cut -d' ' -f9- | sed 's/ -/\t-/g' && echo
}
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

From one-liners to reusable tools

- Shebang: #!/usr/bin/env bash


- Permission: chmod +x
- Arguments: $1, $2, $@
- Exit codes: 0, 1, 2
- Extension is not important
- Add to PATH

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Example: CLI for explainshell.com

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Example: CLI for explainshell.com


#!/usr/bin/env bash
# explain: Command-line wrapper for explainshell.com
#
# Example usage: explain tar xzvf
# Dependency: scrape
# Author: http://jeroenjanssens.com

COMMAND="$@"
URL="http://explainshell.com/explain?cmd=${COMMAND}"
curl -s "${URL}" |
scrape -e 'span.dropdown > a, pre' |
sed -re 's/<(\/?)[^>]*>//g'

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Example: CLI for explainshell.com


$ explain tar xzvf
The GNU version of the tar archiving utility

-x, --extract, --get


extract files from an archive

-z, --gzip, --gunzip --ungzip

-v, --verbose
verbosely list files processed

-f, --file ARCHIVE


use archive file or device ARCHIVE

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Command-line tools from existing code

- Accept standard input


- Write to standard output / error
- Parse command-line arguments
- Provide help
- Take Unix philosophy into account

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Parsing command-line arguments with docopt


#!/usr/bin/env python
"""Usage: pycho [-hnv] [STRING ...]

-h --help Show this screen.


-n Do not output trailing newline.
-v --version Show version.
"""
from docopt import docopt
from sys import stdout
if __name__ == "__main__":
args = docopt(__doc__, version="Pycho 1.0")
stdout.write(" ".join(args["STRING"]))
if not args["-n"]:
stdout.write("\n")
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Parsing command-line arguments with docopt


$ pycho -h
Usage: pycho [-hnv] [STRING ...]

-h --help Show this screen.


-n Do not output trailing newline.
-v --version Show version.

$ pycho --version
Pycho 1.0

$ pycho -n COMMAND LINE REPRESENT


COMMAND LINE REPRESENT%

$
Building a Data Science Toolbox Jeroen Janssens
Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

- Data Science Toolbox lets you start doing data


science in minutes
- Command line is great for doing data science
- Does not solve all your problems
- OK to continue with R / IPython / ...

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Where to go from here?

- Install Data Science Toolbox


- Do a tutorial
- Practice your one-liners
- Give (feed)back

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

References

- http://datasciencetoolbox.org
- http://cli.learncodethehardway.org/book/
- https://github.com/tonyfischetti/qstats
- https://github.com/jehiah/json2csv
- https://github.com/bitly/data_hacks
- https://github.com/chrishwiggins/mise
- http://csvkit.readthedocs.org/en/latest/
- http://stedolan.github.io/jq/

Building a Data Science Toolbox Jeroen Janssens


Data Science at the Command Line Data Science Toolbox Building your own Data Science Toolbox
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thank you!

[email protected]
http://jeroenjanssens.com
@jeroenhjanssens

Building a Data Science Toolbox Jeroen Janssens

You might also like