Data Integration By Lumify and Data wrapper
• Lumify is an open source tool kit developed by Altamira that strive to help manage big data.
• Eric Schmidt, executive chairman of Google, was recently quoted as saying that "the biggest disrupter (of
2014) that we're sure about is the arrival of big data and machine intelligence everywhere"
• Lumify addresses the concerns of big data by providing a platform for
• data integration,
• analytics,
• visualization.
• Key features include
• full-text search,
• 2D and 3D graph visualizations,
• Link analysis between graph entities,
• Geospatial analysis via integration with any
• Open Layers-compatible mapping system,
• multimedia analysis to deal with
• text,
• images,
• video, and collaboration features.
Data wrapper
• Data wrapper is a data visualization tool that creates charts and maps to help better tell your story.
• Data wrapper was designed to be used by journalists and can be used on mobile or desktop devices
while in the design phase.
• Features
• Datawrapper provides common charts and maps such as:
• Bar Charts
• Dot, Arrow, and Scatter Plots
• Column Charts
• Line Graphs
• Donut/Pie Charts
• Tables
• Symbol or Choropleth Maps
• Datawrapper is easy to use and requires no advanced knowledge of coding or any sort of design
background.
• Data wrapper makes it easy to:
1. Copy and paste your data: Data can be uploaded from CSV files from popular data collection tools such as Excel
and Google Sheets.
2. Visualize: Select multiple chart and map types and insert annotations
3. Publish: Data wrapper provides an embed code to use on your website, or gives you the ability to export your chart
or map as an image file or a PDF.
Scoop
Introduction
Installation
Import and export
INTRODUCTION
• The traditional application management system, that is, the interaction of applications with relational database using
RDBMS, is one of the sources that generate Big Data.
• Big Data, generated by RDBMS, is stored in Relational Database Servers in the relational database structure.
• When Big Data storages and analyzers such as
• MapReduce,
• Hive,
• HBase,
• Cassandra,
• Pig, etc.
of the Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for importing and exporting
the Big Data residing in them.
• Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between relational database server and
Hadoop’s HDFS.
• Sqoop − “SQL to Hadoop and Hadoop to SQL”
• Sqoop is a tool designed to transfer data between Hadoop and relational database servers.
• It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file
system to relational databases.
• It is provided by the Apache Software Foundation.
Key Features of Sqoop
• Sqoop provides many salient features like:
• Full Load: Apache Sqoop can load the whole table by a single command. You can also load all the tables from a database using a single
command.
• Incremental Load: Apache Sqoop also provides the facility of incremental load where you can load parts of table whenever it is updated.
• Parallel import/export: Sqoop uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
• Import results of SQL query: You can also import the result returned from an SQL query in HDFS.
• Compression: You can compress your data by using deflate(gzip) algorithm with –compress argument, or by specifying –compression-codec
argument. You can also load compressed table in Apache Hive.
• Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for multiple RDBMS databases, covering almost the
entire circumference.
• Kerberos Security Integration: Kerberos is a computer network authentication protocol which works on the basis of ‘tickets’ to allow nodes
communicating over a non-secure network to prove their identity to one another in a secure manner. Sqoop supports Kerberos authentication.
• Load data directly into HIVE/HBase: You can load data directly into Apache Hive for analysis and also dump your data in HBase, which is
a NoSQL database.
• Support for Accumulo: You can also instruct Sqoop to import the table in Accumulo rather than a directory in HDFS.
How Sqoop Works?
The following image describes the workflow of Sqoop.
Sqoop Architecture & Working
Import
• The import tool imports individual tables from RDBMS to HDFS.
• Each row in a table is treated as a record in HDFS.
• When we submit Sqoop command, our main task gets divided into
subtasks which is handled by individual Map Task internally.
• Map Task is the subtask, which imports part of data to the Hadoop
Ecosystem.
• Collectively, all Map tasks import the whole data.
Export
• Export also works in a similar manner.
• The export tool exports a set of files from HDFS back to an RDBMS.
• The files given as input to Sqoop contain records, which are called as rows in the table.
• When we submit our Job, it is mapped into Map Tasks which brings the chunk of data from HDFS.
• These chunks are exported to a structured data destination.
• Combining all these exported chunks of data, we receive the whole data at the destination, which in most
of the cases is an RDBMS (MYSQL/Oracle/SQL Server).
• Reduce phase is required in case of aggregations.
• But, Apache Sqoop just imports and exports the data;
• it does not perform any aggregations.
• Map job launch multiple mappers depending on the number defined by the user.
• For Sqoop import, each mapper task will be assigned with a part of data to be imported.
• Sqoop distributes the input data among the mappers equally to get high performance.
• Then each mapper creates a connection with the database using JDBC and fetches the part of data assigned
by Sqoop and writes it into HDFS or Hive or HBase based on the arguments provided in the CLI.
Sqoop Import and Export
• Sqoop Import
• The import tool imports individual tables from RDBMS to HDFS.
• Each row in a table is treated as a record in HDFS.
• All records are stored as text data in text files or as binary data in Avro and Sequence
files.
• Sqoop Export
• The export tool exports a set of files from HDFS back to an RDBMS.
• The files given as input to Sqoop, contain records which are called as rows in table.
• Those are read and parsed into a set of records and delimited with user-specified delimiter.
Example
Let us take an example of three tables named as emp, emp_add, and emp_contact, which are in a database called
userdb in a MySQL database server.
The three tables and their data are as follows.
emp:
id name deg salary dept
1201 gopal manager 50,000 TP
1202 manisha Proof reader 50,000 TP
1203 khalil php dev 30,000 AC
1204 prasanth php dev 30,000 AC
1204 kranthi admin 20,000 TP
emp_add:
id hno street city
1201 288A vgiri jublee
1202 108I aoc sec-bad
1203 144Z pgutta hyd
1204 78B old city sec-bad
1205 720X hitec sec-bad
emp_contact:
id phno email
1201 2356742 [email protected]
1202 1661663 [email protected]
1203 8887776 [email protected]
1204 9988774 [email protected]
1205 1231231 [email protected]
Import Tool
Syntax
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
The following command is used to import the emp table from MySQL
database server to HDFS.
$ sqoop import \ connect jdbc:mysql://localhost/userdb \--username root \ table emp --m 1
Here -m 1 specifies one mapper for each table. All the tables are downloaded indefault directory.The default number of
mappers used is 4.
You can change this by appending the command by "-m number_of_mappers".
To verify the imported data in HDFS, use the following command.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
It shows you the emp table data and fields are separated with comma (,).
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
Importing into Target Directory
We can specify the target directory while importing table data into HDFS
using the Sqoop import tool.
the syntax to specify the target directory as option to the Sqoop import
command.
--target-dir <new or exist directory in HDFS>
The following command is used to import emp_add table data into
‘/queryresult’ directory.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
Import
• The following command is used to verify the imported data in
/queryresult directory form emp_add table.
• $ $HADOOP_HOME/bin/hadoop fs -cat /queryresult/part-m-*
• It will show you the emp_add table data with comma (,) separated
fields.
1201, 288A, vgiri, jublee
1202, 108I, aoc, sec-bad
1203, 144Z, pgutta, hyd
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
Import Subset of Table Data
We can import a subset of a table using the ‘where’ clause in Sqoop import tool.
It executes the corresponding SQL query in the respective database server and stores the result in a target directory
in HDFS.
The syntax for where clause is as follows.
--where <condition>
Example: The following command is used to import a subset of emp_add table data.
The subset query is to retrieve the employee id and address, who lives in Secunderabad city.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /wherequery
• The following command is used to verify the imported data in
/wherequery directory from the emp_add table.
• $ $HADOOP_HOME/bin/hadoop fs -cat /wherequery/part-m-*
• It will show you the emp_add table data with comma (,) separated
fields.
1202, 108I, aoc, sec-bad
1204, 78B, oldcity, sec-bad
1205, 720C, hitech, sec-bad
Incremental Import
• Incremental import is a technique that imports only the newly added rows
in a table.
• It is required to add ‘incremental’, ‘check-column’, and ‘last-value’
options to perform the incremental import.
• The following syntax is used for the incremental option in Sqoop import
command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
Example
Let us assume the newly added data into emp table is as follows −
1206, satish p, grp des, 20000, GR
The following command is used to perform the incremental import in the emp table.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
The following command is used to verify the imported data from emp table to HDFS emp/
directory.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*
It shows you the emp table data with comma (,) separated fields.
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
1206, satish p, grp des, 20000, GR
The following command is used to see the modified or newly added rows from the emp table.
$ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*1
It shows you the newly added rows to the emp table with comma (,) separated fields.
1206, satish p, grp des, 20000, GR
Import all tables
Syntax
$ sqoop import-all-tables (generic-args) (import-args)
Example
Let us take an example of importing all tables from the userdb database.
The list of tables that the database userdb contains is as follows.
+--------------------+ | Tables | +--------------------+ | emp | | emp_add | | emp_contact | +--------------------+
The following command is used to import all the tables from the userdb database.
$ sqoop import-all-tables \ --connect jdbc:mysql://localhost/userdb \ --username root
The following command is used to verify all the table data to the userdb database in HDFS.
$ $HADOOP_HOME/bin/hadoop fs –ls
Output
drwxr-xr-x - hadoop supergroup 0 2014-12-22 22:50 _sqoop drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:46
emp drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:50 emp_add drwxr-xr-x - hadoop supergroup 0 2014-12-23 01:52 emp_contact
Export
The following is the syntax for the export command.
$ sqoop export (generic-args) (export-args)
Example
Let us take an example of the employee data in file, in HDFS.
The employee data is available in emp_data file in ‘emp/’ directory in HDFS.
The emp_data is as follows.
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP
1206, satish p, grp des, 20000, GR
It is mandatory that the table to be exported is created manually and is present in the
database from where it has to be exported.
The following query is used to create the table ‘employee’ in mysql
command line.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
The following command is used to export the table data (which is in
emp_data file on HDFS) to the employee table in db database of Mysql
database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
mysql>select * from employee;
If the given data is stored successfully, then you can find the following table of given employee
data.
+------+--------------+-------------+-------------------+--------+
| Id | Name | Designation | Salary | Dept |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | kalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
| 1206 | satish p | grp des | 20000 | GR |
+------+--------------+-------------+-------------------+--------+