100% found this document useful (1 vote)
90 views191 pages

DA Unit 5

Uploaded by

Mihir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
90 views191 pages

DA Unit 5

Uploaded by

Mihir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 191

Data Analytics (BCS-052)

Unit 5
Frame Works and Visualization
Syllabus

Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase,


MapR, Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems,
Visualization: visual data analysis techniques, interaction techniques, systems
and applications.

Introduction to R - R graphical user interfaces, data import and export,


attribute and data types, descriptive statistics, exploratory data analysis,
visualization before analysis, analytics for unstructured data
Frame Works
• A framework is like a structure that provides a base for the application
development process.

• Frameworks provide a set of tools and elements that help in the speedy
development process. It acts like a template that can be used and even
modified to meet the project requirements.

• Frameworks are based on programming languages. Some popular frameworks


that are most used are Django, Flutter, Angular, Vue, PyTorch, Spring Boot,
React Native, Apache Spark, Ionic, etc. These frameworks allow developers to
create robust and rich functionalities software.
Visualization

• Visualization is the process of creating graphical representations of data or


information to make it easier to understand, analyze, and communicate.

• It involves transforming complex data sets into visuals like charts, graphs,
maps, or even interactive dashboards, allowing people to spot trends,
patterns, and insights quickly.
MapReduce

MapReduce is a programming model and processing technique for handling


and analyzing large data sets in a distributed computing environment. It was
introduced by Google to efficiently process vast amounts of data across multiple
servers in parallel. MapReduce breaks down data processing into two main
functions:

1. Map: The input data is divided into smaller, manageable chunks, which are
processed in parallel. In this step, each chunk is analyzed or transformed, and a
key-value pair is generated.
MapReduce (Contd…)

2. Reduce: The output from the map step is then grouped by key and processed to
combine the values. This step summarizes the data based on the specific problem
requirements. For instance, in the word count example, the reduce function would
aggregate counts for each unique word across all chunks, providing a total count.

• MapReduce is highly scalable and works well with distributed storage systems
like Hadoop Distributed File System (HDFS). It’s widely used in big data
applications where data sets are too large to fit on a single machine, such as in
search engines, recommendation systems, and large-scale analytics.
MapReduce Architecture
Components of MapReduce Architecture

1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.

2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to process or
execute.

3. Hadoop MapReduce Master: It divides the particular job into subsequent


job-parts.
Components of MapReduce Architecture (Contd…)

4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.

5. Input Data: The data set that is fed to the MapReduce for processing.

6. Output Data: The final result is obtained after the processing.


Hadoop

• Hadoop is an open-source software programming framework for storing a


large amount of data and performing the computation. Its framework is based
on Java programming with some native code in C and shell scripts.

• Hadoop is an open-source software framework that is used for storing and


processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.
Hadoop (Contd…)

Hadoop has two main components:

1. HDFS (Hadoop Distributed File System): This is the storage component of


Hadoop, which allows for the storage of large amounts of data across multiple
machines. It is designed to work with commodity hardware, which makes it cost-
effective.

2. YARN (Yet Another Resource Negotiator): This is the resource management


component of Hadoop, which manages the allocation of resources (such as CPU
and memory) for processing the data stored in HDFS.
Hadoop Architecture

• Hadoop is a framework written in Java that utilises a large cluster of


commodity hardware to maintain and store big size data.

• Hadoop works on MapReduce Programming Algorithm that was


introduced by Google. Today lots of Big Brand Companies are using Hadoop
in their Organisation to deal with Big Data, e.g., Facebook, Yahoo, Netflix,
eBay, etc. The Hadoop Architecture mainly consist of four components.
Hadoop Architecture (Contd…)
Features of Hadoop

1. It is fault tolerance.

2. It is highly available.

3. Its programming is easy.

4. It have huge flexible storage.

5. It is low cost.
Advantages of Hadoop

• Ability to store a large amount of data.

• High flexibility.

• Cost effective.

• High computational power.

• Tasks are independent.

• Linear scaling.
Disadvantages of Hadoop

• Not very effective for small data.

• Hard cluster management.

• Has stability issues.

• Security concerns.
Pig

• Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets.

• It provides a high-level of abstraction for processing over the MapReduce. It provides a high-
level scripting language, known as Pig Latin which is used to develop the data analysis codes.

• First, to process the data which is stored in the HDFS, the programmers will write the scripts
using the Pig Latin Language. Internally Pig Engine(a component of Apache Pig) converted
all these scripts into a specific map and reduce task. But these are not visible to the programmers
in order to provide a high-level of abstraction.

• Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of
Pig always stored in the HDFS.
Need of Pig

• It uses query approach which results in reducing the length of the code.

• Pig Latin is SQL like language.

• It provides many builtin operators.

• It provides nested data types (tuples, bags, map).


Features of Pig
• For performing several operations Apache Pig provides rich sets of operators
like the filtering, joining, sorting, aggregation etc.

• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a
boon..

• Join operation is easy in Apache Pig.

• Fewer lines of code.

• Apache Pig allows splits in the pipeline.

• The data structure is multivalued, nested, and richer.

• Pig can handle the analysis of both structured and unstructured data.
Applications of Pig

• For exploring large datasets Pig Scripting is used.

• Provides the supports across large data-sets for Ad-hoc queries.

• In the prototyping of large data-sets processing algorithms.

• Required to process the time sensitive data loads.

• For collecting large amounts of datasets in form of search logs and web crawls.

• Used where the analytical insights are needed using the sampling.
Apache Pig Architecture

• The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is
high-level data processing language which provides a rich set of data types and
operators to perform various operations on the data.

• To perform a particular task Programmers using Pig, programmers need to write a Pig
script using the Pig Latin language, and executes them using any of the execution
mechanisms. After execution, these scripts will go through a series of transformation
applied by the Pig Framework, to produce the desired output.

• Internally, Apache Pig converts these scripts into a series of Map Reduce jobs, and
thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown
below as:
Apache Pig Architecture (Contd…)
Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework as discussed
below:

• Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script,
does type checking, and other miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin statements and logical operators. In
the DAG, the logical operators of the script are represented as the nodes and the data flows are
represented as edges.

• Optimiser: The logical plan (DAG) is passed to the logical optimiser, which carries out the
logical optimisations such as projection and pushdown.

• Compiler: The compiler compiles the optimised logical plan into a series of MapReduce jobs.
Apache Pig Components (Contd…)

• Execution engine: Finally, the MapReduce jobs are submitted to Hadoop in a


sorted order. Finally, these MapReduce jobs are executed on Hadoop producing
the desired results.

• Pig Latin Data Model: The data model of Pig Latin is fully nested, and it
allows complex non-atomic datatypes such as map and tuple.

• Atom: Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Apache Pig Components (Contd…)

• Tuple: A record that is formed by an ordered set of fields is known as a tuple; the fields can
be of any type. A tuple is like a row in a table of RDBMS. Example: (Raja, 30)

• Bag: A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique)
is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is
represented by { }. It is like a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that the fields in the same
position (column)have the same type. Example: ((Raja, 30), (Mohammad, 45))

• Map: A map (or data map) is a set of key-value pairs. The key needs to be of type char array
and should be unique. The value might be of any type. It is represented by '[]’. Example:
(name#Raja, age#30]
Difference between Pig and MapReduce
Apache Pig MapReduce
• It is a scripting language. • It is a compiled programming language.

• Abstraction is at higher level. • Abstraction is at lower level.

• It have less line of code as compared to


• Lines of code is more.
MapReduce.

• More development efforts are required for


• Less effort is needed for Apache Pig.
MapReduce.

• Code efficiency is less as compared to


• As compared to Pig efficiency of code is higher.
MapReduce.

• Pig provides built in functions for ordering,


• Hard to perform data operations.
sorting and union.

• It allows nested data types like map, tuple and bag • It does not allow nested data types
Difference between Pig and SQL

Pig SQL

Pig Latin is a Procedural language. SQL is a declarative language.

In Apache Pig, schema is optional. We Shema is mandatory in SQL.


can store data without designing a
schema (values are stored as $01, $02
etc.)
The data model in Apache Pig is nested The data model used in SQL is flat
relational. relational.
Apache Pig provides opportunity for There is more opportunity for query
Query optimisation. optimisation in SQL.
Hive

• Hive is a data warehouse infrastructure tool to process structured data in


Hadoop. It resides top of Hadoop to summarise Big Data and makes querying
and analysing easy.

• Initially Hive was developed by Facebook, later the Apache Software


Foundation to it up and developed it further as an open source under the name
Apache Hive. It is used different companies. For example, Amazon uses it in
Amazon Elastic Map Reduce.
Hive (Contd…)

Hive is not

• A relational database

• A design for Online Transaction Processing (OLTP)

• A language for real-time queries and row-level updates


Features of Hive

• Hive is fast and scalable.

• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to


MapReduce or Spark jobs.

• It can analyse large datasets stored in HDFS.

• It allows different storage types such as plain text, RCFile, and HBase.

• It uses indexing to accelerate queries.

• It can operate on compressed data stored in the Hadoop ecosystem.

• It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive

• Hive is not capable of handling real-time data.

• It is not designed for online transaction processing.

• Hive queries contain high latency.


Architecture of Hive

The following component diagram depicts the architecture of Hive.


HBase

• HBase is a data model that is like Google's big table. It is an open source, distributed
database developed by Apache software foundation written in Java.

• HBase is an essential part of our Hadoop ecosystem. HBase runs on top of HDFS
(Hadoop Distributed File System). It can store massive amounts of data from
terabytes to petabytes. It is column oriented and horizontally scalable.

• HBase is a column-oriented non-relational database management system that runs


on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant
way of storing sparse datasets, which are common in many big data use cases. It is well
suited for real-time data processing or random read/write access to large volumes of
data.
Hbase (Contd…)

• It is an open-source project and is horizontally scalable.

• HBase is a data model that is like Google's big table designed to provide
quick random access to huge amounts of structured data.

• It leverages the fault tolerance provided by the Hadoop File System


(HDFS). It is a part of the Hadoop ecosystem that provides random real-
time read/write access to data in the Hadoop File System.
Hbase (Contd…)
Features of HBase
• Hbase is linearly scalable.

• It has automatic failure support.

• It provides consistent read and writes.

• It integrates with Hadoop, both as a source and a destination.

• It has easy java API for client.

• It provides data replication across clusters.


Where to Use HBase

• Apache Hbase is used to have random, real-time read/write access to Big


Data.

• It hosts very large tables on top of clusters of commodity hardware.

• Apache HBase is a non-relational database modeled after Google’s Bitable.


Bitable acts up on Google File System, likewise Apache Hbase works on top
of Hadoop and HDFS.
Applications of HBase

• It is used whenever there is a need to write heavy applications.

• HBase is used whenever we need to provide fast random access to available


data.

• Companies such as Facebook, Twitter, Yahoo, and Adobe, use Hbase internally.
Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The
table schema defines only column families, which are the key value pairs. A table
has multiple column families, and each column family can have any number of
columns. Subsequent column values are stored contiguously on the disk. Each cell
value of the table has a timestamp. In short, in an Hbase:

• Table is a collection of rows.

• Row is a collection of column families.

• Column family is a collection of columns.

• Column is a collection of key value pairs.


Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns
of data, rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database


It is suitable for Online Transaction Process It is suitable for Online Analytical Processing
(OLTP). (OLAP).
Such databases are designed for small Column-oriented databases are designed for
number of rows and columns. huge tables.
Difference between HBase and HDFS
HDFS (Hadoop Distributed File System) HBase
HDFS is a distributed file system suitable for HBase is a database built on top of
storing large files. the HDFS.
HDFS does not support fast individual record HBase provides fast lookups for
lookups. larger tables.
It provides high latency batch processing; no It provides low latency access to
concept of batch processing. single rows from billions of records
(Random Access).

It provides only sequential access of data. HBase internally uses Hash tables
and provides random access, and it
stores the data in indexed HDFS
files for faster lookups.
HBase and RDBMS

HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,
concept of fixed columns schema; defines which describes the whole structure of
only column families. tables.

It is built for wide tables. HBase is It is thin and built for small tables. Hard to
horizontally scalable. scale.
No transaction are there in HBase. RDBMS is transactional.
It has deformalized data. It will have normalized data.
It is good for semi-structured as well as It is good for structured data.
structured data.
MapR

MapR was a data platform that provided a distribution of Apache Hadoop along with
additional tools and capabilities for big data processing and analytics. It aimed to
simplify the management, storage, and processing of large volumes of data across
various environments, including on-premises and cloud.

• Key features of MapR included:

1. Distributed File System: MapR offered a high-performance file system that


allowed for the storage and retrieval of data across a cluster of machines.

2. Data Management: It provided tools for managing data workflows, including


support for real-time analytics and batch processing.
MapR (Contd…)

3. Ecosystem Integration: MapR was designed to integrate with various big


data tools and frameworks, such as Apache Spark, Apache Hive, and Apache
Drill.

4. NoSQL Database: MapR included a NoSQL database, enabling users to store


and query semi-structured and unstructured data efficiently.

5. Data Security and Governance: The platform featured robust security


controls and data governance capabilities to protect sensitive data.
Sharding

• Sharding is a very important concept that helps the system to keep data in
different resources according to the sharding process. The word “Shard”
means “a small part of a whole“.

• Sharding means dividing a larger part into smaller parts. In DBMS,


Sharding is a type of DataBase partitioning in which a large database is divided
or partitioned into smaller data and different nodes. These shards are not only
smaller, but also faster and hence easily manageable.
Sharding (Contd…)

• Sharding is a method for distributing a single dataset across multiple databases,


which can then be stored on multiple machines. This allows for larger datasets to be split
into smaller chunks and stored in multiple data nodes, increasing the total storage
capacity of the system.

• Sharding is a form of scaling known as horizontal scaling or scale-out, as additional


nodes are brought on to share the load.

• Horizontal scaling allows for near-limitless scalability to handle big data and intense
workloads.

• In contrast, vertical scaling refers to increasing the power of a single machine or single
server through a more powerful CPU, increased RAM, or increased storage capacity.
Sharding (Contd…)
Why Sharding ?

• In replication, all writes go to master node.

• Latency sensitive queries still go to master.

• Single replica set has limitation of 12 nodes.

• Memory can't be large enough when active dataset is big.

• Local disk is not big enough.

• Vertical scaling is too expensive.


Advantages of Sharding
Sharding allows you to scale your database to handle increased load to a nearly
unlimited degree by providing increased read/write throughput, storage capacity,
and high availability.
• Increased read/write throughput: By distributing the dataset across multiple
shards, both read and write operation capacity is increased as long as read and
write operations are confined to a single shard.
• Increased storage capacity: By increasing the number of shards, you can also
increase overall total storage capacity, allowing near-infinite scalability.
• High availability: Shards provide high availability in two ways. First, since
each shard is a replica set, every piece of data is replicated. Second, even if an
entire shard becomes unavailable since the data is distributed, the database
remains partially functional, with part of the schema on different shards.
Disadvantages of Sharding
Sharding does come with several drawbacks, namely overhead in query result compilation,
complexity of administration, and increased infrastructure costs.
• Query overhead: Each Sharded database must have a separate machine or service which
understands how to route a querying operation to the appropriate shard. This introduces
additional latency on every operation. Furthermore, if the data required for the query is
horizontally partitioned across multiple shards, the router must then query each shard and
merge the result together.
• Complexity of administration: With a single unsharded database, only the database server
itself requires upkeep and maintenance. Overall, a Sharded database is a more complex
system which requires more administration.
• Increased infrastructure costs: Sharding by its nature requires additional machines and
compute power over a single database server. The cost of a distributed database system,
especially if it is missing the proper optimisation, can be significant.
NOSQL DATABASE

• NoSQL databases (aka 'not only SQL) are non-tabular databases and store
data differently than melational tables. NoSQL databases come in a variety of
types based on their data model. The main types are document, key-value, wide-
column, and graph. They provide flexible schemas and scale easily with large
amounts of data and high user loads.

• NoSQL database," they typically use it to refer to any non-relational


database. Some say the term "NoSQL stands for "non-SQL while others say it
stands for "not only SQL" Either way, but most also agree that NoSQL
databases are databases that store data in a format other than relational tables.
History of NoSQL Databases

• NoSQL databases emerged in the late 2000s as the cost of storage dramatically
decreased. Gone were the days of needing to create a complex, difficult-to-
manage data model to avoid data duplication. Developers (rather than storage)
were becoming the primary cost of software development, so NoSQL databases
optimised for developer productivity.
NoSQL Database Features

Each NoSQL database has its own unique features. At a high level, many
NoSQL databases have the following features:

• Flexible schemas

• Fast queries due to the data model

• Horizontal scaling

• Ease of use for developers


Types of NoSQL Databases
Over time, four major types of NoSQL, databases emerged: document
databases, key-value databases, wide-column stores, and graph databases.
• Document databases store data in documents like JSON (JavaScript Object
Notation) objects. Each document contains pairs of fields and values. The
values can typically be a variety of types including things like strings, numbers,
Booleans, arrays, or object.
• Key-value databases are a simpler type of database where each item contains
keys and values.
• Wide-column stores store data in tables, rows, and dynamic columns.
• Graph databases store data in nodes and edges. Nodes typically store
information about people, places, and things, while edges store information
about the relationships between the nodes.
Difference between RDBMS and NoSQL Databases

While a variety of differences exist between relational database management


systems (RDBMS) and NoSQL databases, one of the key differences is the way
the data is modeled in the database.
When should NoSQL be used?

When deciding which database to use, decision-makers typically find one or


more of the following factors lead them to selecting a NoSQL database:

• Fast-paced Agile development

• Storage of structured and semi-structured data

• Huge volumes of data

• Requirements for scale-out architecture

• Modern application paradigms like micro services and real-time streaming


Advantages of NoSQL
There are many advantages of working with NoSQL databases such as
MongoDB and Cassandra. The main advantages are high scalability and high
availability.
1. High scalability: NoSQL databases use Sharding for horizontal scaling.
Vertical scaling means adding more resources to the existing machine
whereas horizontal scaling means adding more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to
implement.
2. High availability: Auto replication feature in NoSQL databases makes it
highly available because in case of any failure data replicates itself to the
previous consistent state.
Disadvantages of NoSQL

NoSQL. has the following disadvantages.

1. Narrow focus: NoSQL databases have a very narrow focus as it is mainly


designed for storage, but it provides very little functionality. Relational databases
are a better choice in the field of Transaction Management than NoSQL.

2. Open-source: NoSQL is open-source database. There is no reliable standard for


NoSQL yet. In other words, two database systems are likely to be unequal.

3. Management challenge: The purpose of big data tools is to make the


management of a large amount of data as simple as possible. But it is not so easy.
Data management in NoSQL is much more complex than in a relational database.
Disadvantages of NoSQL (Contd…)

4. GUI is not available: GUI mode tools to access the database are not flexibly
available in the market.

5. Backup: Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a consistent
manner.

6. Large document size: Some database systems like MongoDB and CouchDB
store data in JSON format. This means that documents are quite large (Big Data,
network bandwidth, speed), and having descriptive key names actually hurts since
they increase the document size.
S3 (Simple Storage Service)

• S3 is a safe place to store the files. It is Object-based storage, i.e., you can
store the images, word files, pdf files, etc. The files which are stored in S3 can
be from 0 Bytes to 5 TB. It has unlimited storage means that you can store the
data as much you want. Files are stored in Bucket. A bucket is like a folder
available in S3 that stores the files.

• S3 is a universal namespace, i.e., the names must be unique globally. Bucket


contains a DNS address. Therefore, the bucket must contain a unique name to
generate a unique DNS address.
Advantages of Amazon S3
Advantages of Amazon S3 (Contd…)

1. Create Buckets: Firstly, we create a bucket and provide a name to the bucket
Buckets are the containers in S3 that stores the data. Buckets must have a
unique name to generate a unique DNS address.

2. Storing data in buckets: Bucket can be used to store an infinite amount of


data. You can upload the files as much you want into an Amazon S3 bucket,
Le., there is no maximum limit to store the files. Each object can contain up
to 5 TB of data. Each object can be stored and retrieved by using a unique
developer assigned-key.
Advantages of Amazon S3 (Contd…)

3. Download data: You can also download your data from a bucket and can also give
permission to others to download the same data. You can download the data at any time
whenever you want.

4. Permissions: You can also grant or deny access to others who want to download or upload
the data from your Amazon S3 bucket. Authentication mechanism keeps the data secure from
unauthorized access.

5. Standard interfaces: S3 is used with the standard interfaces REST and SOAP interfaces
which are designed in such a way that they can work with any development toolkit.

6. Security: Amazon S3 offers security features by protecting unauthorized users from


accessing your data.
Amazon S3 Concepts
Amazon S3 Concepts (Contd…)
1. Buckets
• A bucket is a container used for storing the objects.
• Every object is incorporated in a bucket.
• For example, if the object named photos/tree.jpg is stored in the tree image bucket,
then it can be addressed by using the URL http://treeimage.s3.amazonaws.com/
photos/tree.jpg.
• A bucket has no limit to the number of objects that it can store. No bucket can exist
inside of other buckets.
• S3 performance remains the same regardless of how many buckets have been
created.
• The AWS user that creates a bucket owns it, and no other AWS user cannot own it.
• Therefore, we can say that the ownership of a bucket is not transferrable.
• The AWS account that creates a bucket can delete a bucket, but no other AWS user
can delete the bucket.
Amazon S3 Concepts (Contd…)

2. Objects

➤ Objects are the entities which are stored in an S3 bucket.

➤ An object consists of object data and metadata where metadata is a set of


name- value pair that describes the data.

➤ An object consists of some default metadata such as date last modified, and
standard HTTP metadata, such as Content type. Custom metadata can also be
specified at the time of storing an object.

➤ It is uniquely identified within a bucket by key and version ID.


Amazon S3 Concepts (Contd…)

3. Key

➤ A key is a unique identifier for an object.

➤ Every object in a bucket is associated with one key.

➤ An object can be uniquely identified by using a combination of bucket name,


the key, and optionally version ID.
Amazon S3 Concepts (Contd…)

4. Regions

You can choose a geographical region in which you want to store the buckets
that you have created.

A region is chosen in a such a way that it optimises the latency, minimise costs
or address regulatory.

Objects will not leave the region unless you explicitly transfer the objects to
another region.
Hadoop Distributed File System (HDFS)

• HDFS (Hadoop Distributed File System) is a unique design that provides


storage for extremely large files with streaming data access pattern and it
runs on commodity hardware.

• Hadoop comes with a distributed file system called HDFS. In HDFS data is
distributed over several machines and replicated to ensure their durability to
failure and high availability to parallel application. It is cost effective as it uses
commodity hardware. It involves the concept of blocks, data nodes and node
name.
Where to use HDFS

Where to use HDFS:

1. Very Large Files: Files should be of hundreds of megabytes, gigabytes or


more.

2. Streaming Data Access: The time to read whole dataset is more important
than latency in reading the first. HDFS is built on write-once and read-many-
times pattern.

3. Commodity Hardware: It works on low-cost hardware.


Where not to use HDFS

1. Low Latency data access: Applications that require very less time to access
the first data should not use HDFS as it is giving importance to whole data
rather than time to fetch the first record.

2. Lots of Small Files: The name node contains the metadata of files in
memory and if the files are small it takes a lot of memory for name node's
memory which is not feasible.

3. Multiple Writes: It should not be used when we must write multiple times.
HDFS Concepts

1. Blocks: A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default, and this is configurable. Files n HDFS are
broken into block-sized chunks, which are stored as independent units. Unlike a
file system, if the file is in HDFS is smaller than block size, then it does not
occupy full blocks size, i.e., 5 MB of file stored in HDFS of block size 128 MB
takes 5MB of space only. The HDFS block size is large just to minimise the cost
of seek.
HDFS Concepts (Contd…)

2. Name Node: HDFS works in master-worker pattern where the name node acts
as master. Name Node is controller and manager of HDFS as it knows the status
and the metadata of all the files in HDFS; the metadata information being file
permission, names and location of each block. The metadata are small, so it is
stored in the memory of name node, allowing faster access to data. Moreover, the
HDFS cluster is accessed by multiple clients concurrently, so all this information
is handled by a single machine. The file system operations like opening, closing,
renaming etc. are executed by it.
HDFS Concepts (Contd…)

3. Data Node: They store and retrieve blocks when they are told to; by client or
name node. They report back to name node periodically, with list of blocks that
they are storing. The data node being a commodity-hardware also does the work
of block creation, deletion and replication as stated by the name node.
HDFS Concepts (Contd…)
Starting HDFS

• The HDFS should be formatted initially and then started in the distributed
mode. Commands are given below.

To Format $ Hadoopnamenode-format

To Start $ start-dfs.sh
HDFS Basic File Operations
1. Putting data to HDFS from local file system

• First create a folder in HDFS where data can be put form local file system.

$ hadoopfs-mkdir /user/test

• Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFSfolder/user/test$ hadoopfs-
copyFromLocal /usr/home/Desktop/data.txt/user/test

• Display the content of HDFS folder$ Hadoop fs -Is/user/test

2. Copying data from HDFS to local file system$ hadoopfs- copyToLocal/user/test/data.txt/usr/bin/data_copy.txt

3. Compare the files and see that both are same

$md5/usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

hadoopfs-mr<arg>

Example: hadoopfs -mr/user/sonoo/


Features of HDFS

• Highly Scalable: HDFS is highly scalable as it can scale hundreds of nodes in a


single cluster.

• Replication: Due to some unfavorable conditions, the node containing the data
may be loss. So, to overcome such problems, HDFS always maintains the copy
of data on a different machine.

• Fault tolerance: In HDFS, the fault tolerance signifies the robustness of the
system in the event of failure. The HDFS is highly fault-tolerant that if any
machine fails, the other machine containing the copy of that data automatically
becomes active.
Features of HDFS (Contd…)

• Distributed data storage: This is one of the most important features of HDFS
that makes Hadoop very powerful. Here, data is divided into multiple blocks and
stored into nodes.

• Portable: HDFS is designed in such a way that it can easily be portable from
platform to another
Goals of HDFS

• Handling the hardware failure: The HDFS contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to recover it quickly.

• Streaming data access: The HDFS applications usually run on the general-
purpose file system. This application requires streaming access to their datasets.

• Coherence Model: The application that runs on HDFS require to follow the
write- once-ready-many approach. So, a file once created need not to be
changed. However, it can be appended and truncate.
Visual Data Analysis

• Data visualisation is a graphical representation of quantitative information and


data by using visual elements like graphs, charts, and maps.

• Data visualisation convert large and small datasets into visuals, which is easy to
understand and process for humans.

• Data visualisation tools provide accessible ways to understand outliers, patterns,


and trends in the data. In the world of Big Data, the data visualisation tools and
technologies are required to analyses vast amounts of information.
Visual Data Analysis (Contd…)

• Data Visualisation Techniques uses charts and graphs to visualise large amounts of
complex data. Visualisation provides a quick and easy way to convey concepts,
summarise and present large data in easy-to-understand and straightforward displays,
which enables reader's insightful information.
• Data visualisation is one of the steps of the data science and data analytics process, which
states that after data has been collected, processed and modeled, it must be visualized for
conclusions to be made.

• Data visualisation is also an element of the broader data presentation architecture (DPA)
discipline, which aims to identify, locate, manipulate, format and deliver data in the most
efficient way possible.
Features of Data Visualisation

• Identify areas that need attention or improvement.

• Clarify which factors influence customer behaviour.

• Decision-making Ability.

• Integration Capability.

• Predict sales volumes.


Understanding the Motive of Visualisation

• Know your data.

• Getting to know the structure of your data.

• Which Variables are we trying to plot?

• How x-axis and y-axis will be used for the representation.

• How different colours symbolise visualisation.


Identify the Purpose of the Visualisation

1. Identifying the purpose of creating a chart is necessary as this helps define the
structure of the process.

2. Select the right chart type.

3. Selecting the right type of chart is very crucial as this defines the overall
functionality of the chart.

4. Attention to Detail using colours, shapes, and sizes.

5. Choosing the correct type of colour, shape, and size is essential for
representing the chart.
Identify the Purpose of the Visualisation (Contd…)
Challenges of Data Visualisation
• Big Data is a large volume, complex dataset. So, such data cannot visualise with
the traditional method as the traditional data visualisation method has many
limitations.
1. Perceptual Scalability: Human eyes cannot extract all relevant information
from a large volume of data. Even sometimes desktop screen has its limitations if
the dataset is large. Too many visualizations are not always possible to fit on a
single screen.
2. Real-time Scalability: It is always expected that all information should be real-
time information, but it is hardly possible as processing the dataset needs time.
3. Interactive Scalability: Interactive data visualisation help to understand what is
inside the datasets, but as big data volume increases exponentially, visualizing
the datasets take a long time. But the challenge is that sometimes the system may
freeze or crash while trying to visualise the datasets.
Data Visualisation Techniques

1. Line Charts: Line Charts involve creating a graph where data is represented
as a line or as a set of data points joined by a line.
Data Visualisation Techniques (Contd…)
2. Area Chart: Area chart structure is a filled-in area that requires at least two
groups of data along an axis.
Data Visualisation Techniques (Contd…)

3. Pie Charts: Pie charts represent a graph in the shape of a circle. The whole
chart is divided into subparts, which look like a sliced pie.
Data Visualisation Techniques (Contd…)

4. Donut Chart: Doughnut charts are pie charts that do not contain any data
inside the circle.
Data Visualisation Techniques (Contd…)

5. Drill Down Pie Charts: Drill down pie charts are used for representing
detailed description for a particular category.
Data Visualisation Techniques (Contd…)

6. Bar Charts: A bar chart is the type of chart in which data is represented in
vertical series and used to compare trends over time.
Data Visualisation Techniques (Contd…)

7. Scatter and Bubble Charts: Creates a chart in which the position and size of
bubbles represent data. Use to show similarities among types of values, mainly
when you have multiple data objects, and you require to see the general relations.
Data Visualisation Techniques (Contd…)

8. 3D Charts: Creating a 3D chart rotate and view a chart from different angles,
which supports in representing data.
Data Visualisation Process Flow and Stages

Each data has its need to illustrate data. Below are the stages and process flow for Data
Visualisation.

1. Acquire: Obtaining the correct data type is a crucial part as the data can be collected
from various sources and can be unstructured.

2. Parse: Provide some structure for the data's meaning by restructuring the received
data into different categories, which helps better visualise and understand data.

3. Filter: Filtering out the data that cannot serve the purpose is essential as filtering out
will remove the unnecessary data, further enhancing the chart visualisation.
Data Visualisation Process Flow and Stages (Contd…)
4. Mining: Building charts from statistics in a way that scientific context is
discrete. Data visualisation helps viewers seek insights that cannot be gained
from raw data or statistics.
5. Represent: One of the most significant challenges for users is deciding which
chart suites best and represents the right information. The data exploration
capability is necessary to statisticians as this reduces the need for duplicated
sampling to determine which data is relevant for each model.
6.Refine: Refining and improving the essential representation helps in user
engagement.
7. Interact: Add methods for handling the data or managing what features are
visible.
Big Data Visualisation Tools
Nowadays, there are many data visualisation tools. Some of them are:
1. Google Chart: Google Chart is one of the easiest tools for visualisation.
With the help of google charts, you can analyse small datasets to complex
unstructured datasets. We can implement simple charts as well as complex
tree diagrams. Google Chart is available cross-platform as well.
2. Tableau: The tableau desktop is a very easy-to-use big data visualisation
tool. Two more versions are available of Tableau. One is 'Tableau Server,' and
the other is cloud-based 'Tableau Online'. Here we can perform visualisation
operations by applying drag and drop methods for creating visual diagrams.
In Tableau, we can create dashboards very efficiently.
3. Microsoft Power BI: This tool is mainly used for business analysis.
Microsoft Power BI can be run from desktops, smartphones, and even tablets.
This tool also provides analysis results very quickly.
Big Data Visualisation Tools (Contd…)

4. D3: D3 is one of the best data visualisation tools. D3.js is an open-source


visualisation tool.

5. Data wrapper: Data wrapper is a simple tool. Even non-technical persons can
use the Data wrapper tool. Data representation in a table format or responsive graphs
like a bar chart, line chart, or map draws quickly in the Data wrapper.

6. Databox: Databox is another visualisation tool. It is an open-source tool. The


whole dataset can store in one location in the Databox tool. Then discover the
insight data and perform visualisation operations. In dashboard can view or match
data from different datasets.
Use Cases of Big Data Visualisation Tools

1. Sports Analysis: Based on previous datasets with the help of visualisation


tools, a winning percentage prediction is possible. Graph plotting for both
teams or players is possible, and analysis can be performed.

2. Fraud Detection: Fraud detection is a famous use case of big data. With
the help of visualisation tools after analysing data, a message can be
generated to others, and they will be careful about such fraud incidents.
Use Cases of Big Data Visualisation Tools (Contd…)

3. Price Optimisation: In any business product, price set is a significant issue


with visualizing tools and all the components used; price can be analysed and
finally compared with market price, and then a relevant price can be set.

4. Security Intelligence: Visualizing criminals records can predict how much


threat they are to society. Each country has its security intelligence, and its task is
to visualise information and inform others about a security threat.
Interactive Data Visualisation

• Interactive data visualisation refers to the use of software that enables direct
actions to modify elements on a graphical plot.

• Interactive data visualisation refers to the use of modern data analysis software that
enables users to directly manipulate and explore graphical representations of data.

• Data visualisation uses visual aids to help analysts efficiently and effectively
understand the significance of data. Interactive data visualisation software improves
upon this concept by incorporating interaction tools that facilitate the modification
of the parameters of a data visualisation, enabling the user to see more detail, create
new insights, generate compelling questions, and capture the full value of the data.
Interactive Data Visualisation Techniques

Deciding what the best interactive data visualisation will be for your project depends on
your end goal and the data available. Some common data visualisation interactions that
will help users explore their data visualizations include:

1. Brushing: Brushing is an interaction in which the mouse controls a paintbrush that


directly changes the colour of a plot, either by drawing an outline around points or by
using the brush itself as a pointer. Brushing scatterplots can either be persistent, in which
the new appearance is retained once the brush has been removed, or transient, in which
changes only remain visible while the active plot is enclosed or intersected by the brush.
Brushing is typically used when multiple plots are visible, and a linking mechanism exists
between the plots.
Interactive Data Visualisation Techniques (Contd…)
2. Painting: Painting refers to the use of persistent brushing, followed by
subsequent operations such as touring to compare the groups.
3. Identification: Identification, also known as label brushing or mouse over, refers
to the automatic appearance of an identifying label when the cursor hovers over a
particular plot element.
4. Scaling: Scaling can be used to change a plot's aspect ratio, revealing different
data features. Scaling is also commonly used to zoom in on dense regions of a
scatter plot.
5. Linking: Linking connects selected elements on different plots. One-to-one
linking entails the projection of data on two different plots, in which a point in one
plot corresponds to exactly one point in the other. Elements may also be categorical
variables, in which all data values corresponding to that category are highlighted in
all the visible plots. Brushing an area in one plot will brush all cases in the
How to Create Interactive Data Visualizations
• Creating various interactive widgets, bar charts, and plots for data visualisation
should start with the three basic attributes of a successful data visualisation
interaction design - available, accessible, and actionable. The general
framework for an interactive data structure visualisation project typically
follows these steps: identify your desired goals, understand the challenges
presented by data constraints, and design a conceptual model in which data can
be quickly iterated and reviewed.
• With a rough, conceptual model in place, data modeling is leveraged to
thoroughly document every piece of data and related meta-data. This is
followed by the design of a user interface and the development of your design's
core technology, which can be accomplished with a variety of interactive data
visualisation tools.
How to Create Interactive Data Visualizations (Contd…)

• Next, it's time to user test in order to refine compatibility, functionality, security, the
user interface, and performance. Now you are ready to launch to your target
audience. Methods for rapid updates should be built in so that your team can stay up
to date with your interactive data visualisation.

• Some popular libraries for creating your own interactive data visualization's include
Altair, Bokeh, Celluloid, Matplotlib, interact, Plotly, Pygal, and Seaborn. Libraries
are available for Python, Jupyter, JavaScript, and R interactive data visualizations.
Scott Murray's Interactive Data Visualisation for the Web is one of the most popular
educational resources for learning how to create interactive data visualizations.
Benefits of Interactive Data Visualizations

Some major benefits of interactive data visualizations include:

1. Identify Trends Faster: Most of the human communication is visual as the


human brain processes graphics magnitudes faster than it does text. Direct
manipulation of analysed data via familiar metaphors and digestible imagery
makes it easy to understand and act on valuable information.

2. Identify Relationships More Effectively: The ability to narrowly focus on


specific metrics enables users to identify otherwise overlooked cause-and-
effect relationships throughout definable timeframes. This is especially useful
in identifying how daily operations affect an organization's goals.
Benefits of Interactive Data Visualizations (Contd…)

3. Useful Data Storytelling: Humans best understand a data story when its
development over time is presented in a clear, linear fashion. A visual data story
in which users can zoom in and out, highlight relevant information, filter, and
change the parameters promotes better understanding of the data by presenting
multiple viewpoints of the data.

4. Simplify Complex Data: A large dataset with a complex data story may
present itself visually as a chaotic, intertwined hairball. Incorporating filtering
and zooming controls can help untangle and make these messes of data more
manageable and can help users glean better insights.
Introduction to R

• R is a programming language and software environment for statistical analysis, graphics


representation and reporting.

• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand, and is currently developed by the R Development Core Team.

• R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.

• This programming language was named R, based on the first letter of first name of the two
R authors (Robert Gentleman and Ross Ihaka). P. is not only entrusted by academic, but
many large companies also use R programming language, including Uber, Google, Airbnb,
Facebook and so on.
R is used for

• Statistical inference

• Data analysis

• Machine learning algorithm


Features of R Programming
There are the following features of R programming:
1. It is a simple and effective programming language which has been well developed.
2. It is data analysis software.
3. It is a well-designed, easy, and effective language which has the concepts of user-defined,
looping, conditional, and various I/O facilities.
4. It has a consistent and incorporated set of tools which are used for data analysis.
5. For different types of calculation on arrays, lists and vectors, R contains a suite of operators.
6. It provides effective data handling and storage facility.
7. It is open-source, powerful, and highly extensible software.
8. It provides highly extensible graphical techniques.
9. It allows us to perform multiple calculations using vectors.10. R is an interpreted language
Why use R Programming?
• There are several tools available in the market to perform data analysis. Learning new
languages is time taken. The data scientist can use two excellent tools, i.e., R and
Python. We may not have time to learn them both at the time when we get started to
learn data science. Learning statistical modeling and algorithm is more important than
to learn a programming language. A programming language is used to compute and
communicate our discovery.
• The important task in data science is the way we deal with the data: clean, feature
engineering, feature selection, and import. It should be our primary focus. Data
scientist job is to understand the data, manipulate it, and expose the best approach. For
machine learning, the best algorithms can be implemented with R. Keras and
TensorFlow allow us to create high- end machine learning techniques. R has a package
to perform Xgboost. Xgboost is one of the best algorithms for Kaggle competition. R
communicate with the other languages and possibly calls Python, Java, C++. The big
data world is also accessible to R. We can connect R with different databases like
Spark or Hadoop.
• In brief, R is a great tool to investigate and explore the data. The elaborate analysis
such as clustering, correlation, and data reduction are done with R.
Applications of R
There are several-applications available in real-time. Some of the popular
applications are as follows:
• Facebook
• Google
• Twitter
• HRDAG
• Sunlight Foundation
• Real Climate
• FDA
R Analytics
• R analytics is data analytics using R programming language, an open-
source language used for statistical computing or graphics. It can be used
for analytics to identify patterns and build practical models.
• R not only can help analyse organizations' data but also be used to help in
the creation and development of software applications that perform
statistical analysis. With a graphical user interface for developing programs, R
supports a variety of analytical modeling techniques such as classical
statistical test, clustering, time-series analysis, linear and nonlinear
modeling, and more. The interface has four windows: the script window,
console window, workspace and history window, and tabs of interest.
• R allows for publication ready plots and graphics and for storage of reusable
analytics for future data.
Benefits of R Analytics

The following are some of the main benefits realised by companies employing R
in their analytics programs:

1. Democratising Analytics Across the Organisation: R can help democratize


analytics by enabling business users with interactive data visualisation and
reporting tools. R can be used for data science by non-data scientists so that
business users and citizen data scientists can make better business decisions. R
analytics can also reduce time spent on data preparation and data wrangling,
allowing data scientists to focus on more complex data science initiatives.
Benefits of R Analytics (Contd…)

2. Providing Deeper, More Accurate Insights: Today, most successful companies are data
driven and therefore data analytics affects almost every area of business. And while there are
a whole host of powerful data analytics tools, R can help create powerful models to analyse
large amounts of data. Analytics and statistical engines using R provide deeper, more accurate
insights for the business. R can be used to develop very specific, in-depth analyses.

3. Leveraging Big Data: R can help with querying big data and is used by many industry
leaders to leverage Big Data across the business. With R analytics, organisations can surface
new insights in their large datasets and make sense of their data. R can handle these big
datasets and is arguably as easy if not easier for most analysts to use as any of the other
analytics tools available today.
Benefits of R Analytics (Contd…)

4. Creating Interactive Data Visualisations: R is also helpful for data


visualisation and data exploration because it supports the creation of graphs and
diagrams. It includes the ability to create interactive Visualisations and 3D charts
and graphs that are helpful for communicating with business users.
Graphic User Interfaces

• R is a command line driven program. The user enters commands at the


prompt (> by default) and each command is executed one at a time. There
have been a number of attempts to create a more graphical interface,
ranging from code editors that interact with R, to full-blown GUIs that present
the user with menus and dialog boxes.

• Studio is example of a code editor that interfaces with R for Windows, Mac
OS, and Linux platforms.

• Perhaps the most stable, full-blown GUI is R Commander, which can also run
under Windows, Linux, and Mac OS.
Graphic User Interfaces (Contd…)

• R is an open-source programming language and software environment for statistical


computing and graphics.

• It consists of a language together with a run-time environment with a debugger, graphics, access
to system functions, and scripting.

• R is an implementation of the S programming language, developed by Bell Laboratories, adding


lexical scoping semantics.

• R offers a wide variety of statistical and graphical techniques including time series analysis,
linear and nonlinear modeling, classical statistical tests, classification, clustering, and more).
Combined with a large collection of intermediate tools for data analysis, good data handling and
storage, general matrix calculation toolbox, R offers a coherent and well-developed system which is
highly extensible. Many statisticians and data scientists use R with the command line.
R Graphics
• Graphics play an important role in carrying out the important features of the
data. Graphics are used to examine marginal distributions, relationships
between variables, and summary of very large data. It is a very important
complement for many statistical and computational techniques.
Standard Graphics

R standard graphics are available through package graphics; include several


functions which provide statistical plots, like:

• Scatterplots

• Piecharis

• Boxplots

• Barplots etc.

We use the above graphs that are typically a single function call.
Graphics Devices
• It is something where we can make a plot to appear. A graphics device is a
window on your computer (screen device), a PDF file (file device), a Scalable
Vector Graphics (SVG) file (file device), or a PNG or JPEG file (file device).

• There are some of the following points which are essential to understand:

1. The functions of graphics devices produce output, which depends on the active
graphics device.

2. A screen is the default and most frequently used device.

3. R graphical devices such as the PDF device, the JPEG device, etc. are used.
The Basics of the Grammar of Graphic

• There are some key elements of a statistical graphic. These elements are the
basics of the grammar of graphics.
The Basics of the Grammar of Graphic (Contd…)
1. Data: Data is the most crucial thing which is processed and generates an output.
2. Aesthetic Mappings: Aesthetic mappings are one of the most important
elements of a statistical graphic. It controls the relation between graphics variables
and data variables. In a scatter plot, it also helps to map the temperature variable of
a dataset into the X variable. In graphics, it helps to map the species of a plant into
the colour of dots.
3. Geometric Objects: Geometric objects are used to express each observation by
a point using the aesthetic mappings. It maps two variables in the dataset into the x,
y variables of the plot.
4. Statistical Transformations: Statistical transformations allow us to calculate
the statistical analysis of the data in the plot. The statistical transformation uses the
data and approximates it with the help of a regression line having x, y coordinates,
and counts occurrences of certain values.
The Basics of the Grammar of Graphic (Contd…)

5. Scales: It is used to map the data values into values present in the coordinate
system of the graphics device.

6. Coordinate system: The coordinate system plays an important role in the


plotting of the data.

• Cartesian

• Plot

7. Faceting: Faceting is used to split the data into subgroups and draw sub-
graphs for each group.
Advantages of Data Visualisation in R

1. Understanding: It can be more attractive to look at the business. And it is


easier to understand through graphics and charts than a written document with
text and numbers. Thus, it can attract a wider range of audiences. Also, it
promotes the widespread use of business insights that come to make better
decisions.

2. Efficiency: Its applications allow us to display a lot of information in a small


space. Although, the decision-making process in business is inherently
complex and multifunctional, displaying evaluation findings in a graph can
allow companies to organise a lot of interrelated information in useful ways.
Advantages of Data Visualisation in R (Contd…)

3. Location: Its app utilizing features such as Geographic Maps and GIS can be
particularly relevant to wider business when the location is a very relevant factor.
We will use maps to show business insights from various locations; also consider
the seriousness of the issues, the reasons behind them, and working groups to
address them.
Disadvantages of Data Visualisation in R

1. Cost: R application development ranges a good amount of money. It may not be


possible, especially for small companies, that many resources can be spent on purchasing
them. To generate reports, many companies may employ professionals to create charts that
can increase costs. Small enterprises are often operating in resource-limited settings and are
also receiving timely evaluation results that can often be of high importance.

2. Distraction: However, at times, data visualisation apps create highly complex and fancy
graphics-rich reports and charts, which may entice users to focus more on the form than the
function. If we first add visual appeal, then the overall value of the graphic representation
will be minimal. In resource-setting, it is required to understand how resources can be best
used. And it is also not caught in the graphics trend without a clear purpose.
Graphical User Interfaces for R

• RStudio: Professional software for R with a code editor, debugging and


visualisation tools

• Rattle: R Analytic Tool to Learn Easily: Data Mining using R

• StatET for R: Eclipse based IDE (integrated development environment) for R

• RKWard: Easy to use and easily extensible IDE/GUI

• JGR: Universal and unified graphical user interface for R

• R Commander: A Basic-Statistics GUI for R

• Deducer: Intuitive, cross-platform graphical data analysis system


Data Import and Export

Importing Data in R

• Importing data in R programming means that we can read data from external files,
write data to external files, and can access those files from outside the R
environment. File formats like CSV, XML, xlsx, JSON, and web data can be imported
into the R environment to read the data and perform data analysis, and also the data
present in the R environment can be stored in external files in the same file formats.

• The easiest form of data to import into R is a simple text file, and this will often be
acceptable for problems of small or medium scale. The primary function to import from a
text file is scan, and this underlies most of the more convenient functions discussed in
Spreadsheet- like data.
Data Import and Export (Contd…)
• However, all statistical consultants are familiar with being presented by a client
with a memory stick (formerly, a floppy disc or CD-R) of data in some
proprietary binary format. for example, 'an Excel spreadsheet' or 'an SPSS file'.
Often the simplest thing to do is to use the originating application to export the
data as a text file (and statistical consultants will have copies of the most
common applications on their computers for that purpose). However, this is not
always possible, and Importing from other statistical systems discusses what
facilities are available to access such files directly from R.
• For Excel spreadsheets, the available methods are summarised in Reading Excel
spreadsheets. In a few cases, data have been stored in a binary form for
compactness and speed of access. One application of this that we have seen
several times is imaging data, which is normally stored as a stream of bytes as
represented in memory, possibly preceded by a header. Such data formats are
discussed in Binary files and Binary connections.
Data Import and Export (Contd…)
Reading CSV Files

• CSV (Comma Separated Values) is a text file in which the values in columns are
separated by a comma. For importing data in the R programming environment,
we have to set our working directory with the setwd() function.

• For example: setwd("C:/Users/intellipaat/Desktop/BLOG/files")To read a csv


file, we use the in-built function read.csv() that outputs the data from the file as
a data frame.

• For example: read.data<- read.csv("file1.csv")print(read.data)


Exporting Data in R

1. Exporting Data into Text/CSV: Exporting data into a text or a CSV file, is
the most popular and indeed common way of data export. Not only because
most of the software supports the option to export data into Text or CSV but
also because these files are supported by almost every software/programming
language that exists.

• There are two ways of exporting data into text files through R. One is using the
base R functions, and another one is using the functions from the readr package
to export data into text/CSV format.
Exporting Data in R (Contd…)
2. Using Built-in Functions: There is a popular built-in R function named write. table () which
can do the task of exporting the data into text files from R workspace. The function has two other
special cases namely write.csv() and write.delim() out of which the first one helps to export the
data into CSV format and the second one is adjusted way of write. table() where default
delimiters can be adjusted.

3. Using Functions from readr Package: The functions under the readr package are similar to
the functions available under base R. There is a minor difference in their look (write.csv() from
base R and write_csv() from readr do the same stuff). Besides the functions developed under the
readr package are using path = argument instead of file = to specify the path where the file needs
to export. The functions from the readr package exclude the row names by default.
Exporting Data in R (Contd…)

4. Exporting Data into Excel: Now, to export data into Excel from R workspace, the best
bait you could put on is the writexl package. This package allows you to export data as an
Excel file into xlsx format. It may be looking outdated at this moment but believe me, the
functions do their task with precision. Besides, we always have the compatibility of new
excel to be able to connect with the older versions.

5. Exporting Data into R Objects: There might be situations where you wanted to share
the data from R as Objects and share those with your colleagues through different systems
so that they can use it right away into their R workspace. These objects are of two
types .rda/.RData and .rds.
Exporting Data from Scripts in R Programming

So far, the operations using R program are done on a prompt/terminal which is


not stored anywhere. But in the software industry, most of the programs are
written to store the information fetched from the program. One such way is to
store the fetched information in a file. So, the two most common operations that
can be performed on a file are:

• Importing Data to R scripts

• Exporting Data from R scripts


Attributes and Data Types

• Attributes: An attribute is a data item that appears as a property of a data


entity. Machine learning literature tends to use the term feature while
statisticians prefer the term variable.

• Example: Let's consider an example like name, address, email, etc. are the
attributes for the contact information.

• Perceived values for a given attribute are termed as observations. The variety of
an attribute is insisted on by the set of feasible values - nominal, binary,
ordinal, or numeric.
Types of Attributes
1. Nominal Attributes: Nominal means "relating to names". The utilities of a nominal
attribute are sign or title of objects. Each value represents some kind of category,
code or state, and so nominal attributes are also referred to as categorical.
Example: Suppose that skin colour and education status are two attributes of expressing
person objects. In our implementation, possible values for skin colour are dark, white,
brown. The attributes for education status can contain the values-undergraduate,
postgraduate, matriculate. Both skin colour and education status are nominal attributes.
2. Binary Attributes: A binary attribute is a category of nominal attributes that
contains only two classes: 0 or 1, where 0 often tells that the attribute is not present and
1 tells that it is present. Binary attributes are mentioned as Boolean if the two conditions
agree to true and false.
Example: Given the attribute drinker narrate a patient item, 1 specify that the drinker
drinks, while 0 specify that the patient does not. Similarly, suppose the patient undergoes
a medical test that has two practicable outcomes.
Types of Attributes (Contd…)

3. Ordinal Attributes: An ordinal attribute is an attribute with a viable advantage


that has a significant sequence or ranking among them, but the enormity
between consecutive values is not known.

Example: Suppose that food quantity corresponds to the variety of dishes


available at a restaurant. The nominal attribute has three possible values: starters,
main course, and combo.

The values have a meaningful sequence that corresponds to different food


quantity however, we cannot tell from the values how much bigger, say, a medium
is than a large.
Types of Attributes (Contd…)
4. Numeric Attributes: A numeric attribute is calculable, that is, it is a quantifiable amount that constitutes
integer or real values. Numeric attributes can be of two types as follows: Interval-scaled, and Ratio -scaled.
(i) Interval - Scaled Attributes: Interval - scaled attributes are calculated on a uniform- size units. The values
of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify the difference between values.
Example: A temperature attribute is an interval-scaled. We have different temperature values for every new day,
where each day is an entity. By sequencing the values, we obtain an arrangement of entities with reference to
temperature. In addition, we can quantify the difference in the value between values, for example, a temperature
of 20 degrees C is five degrees higher than a temperature of 15 degrees C.
(ii) Ratio-Scaled Attributes: A ratio-scaled attribute is a category of a numeric attribute fixes zero points. In
inclusion, the entities are structured, and we can also compute the difference between values, as well as the
mean, median, and mode.
Example: The Kelvin (K) temperature scale has what is contemplated as a true zero point. It is the point at
which the tiny bits that consist of matter has zero kinetic energy.
Types of Attributes (Contd…)

5. Discrete Attribute: A discrete attribute has a limited or restricted unlimited


set of values, which may appear as integers. The attributes skin colour, drinker,
medical report, and drink size each have a finite number of values and so are
discrete.

6. Continuous Attribute: A continuous attribute has real numbers as attribute


values.

Example: Height, weight, and temperature have real values. Real values can only
be represented and measured using finite number of digits. Continuous attributes
are typically represented as floating-point variables.
Examples of Data Types
Descriptive Statistics in R

• In Descriptive analysis, we are describing our data with the help of various
representative methods like using charts, graphs, tables, excel files, etc.

• In the descriptive analysis, we describe our data. in some manner and


present it in a meaningful way so that it can be easily understood.

• Most of the time it is performed on small datasets and this analysis helps us a
lot to predict some future trends based on the current findings. Some
measures that are used to describe a dataset are measures of central tendency
and measures of variability or dispersion.
Process of Descriptive Analysis

1. Measure of central tendency: It represents the whole set of data by a single


value. It gives us the location of central points. There are three main measures of central
tendency:
• Mean
• Mode
• Median

2. Measure of variability: Measure of variability is known as the spread of data or


how well the data is distributed. The most common variability measures are:
• Range
• Variance
• Standard deviation
Need of Descriptive Analysis

• Descriptive Analysis helps us to understand our data and is a very important part of
Machine Learning. This is due to Machine Learning being all about making predictions.
On the other hand, statistics is all about drawing conclusions from data, which is a
necessary initial step for Machine Learning. Let's do this descriptive analysis in R.

Descriptive Analysis in R

Descriptive analyses consist of describing simply the data using some summary statistics
and graphics. Here, we'll describe how to compute summary statistics using R software.

Import your data into R

Before doing any computation, first, we need to prepare our data, save our data in external bet
or .csv files and it's a best practice to save the file in the current directory.
R Functions for Computing Descriptive Analysis
R Functions for Computing Descriptive Analysis (Contd…)

1. Mean: It is the sum of observations divided by the total number of


observations. It is also defined as average which is the sum divided by count.

Mean (x )= / n

Where n number of terms.

2. Median: It is the middle value of the dataset. It splits the data into two
halves. If the number of elements in the dataset is odd, then the center element
is median and if it is even then the median would be the average of two
central elements.
R Functions for Computing Descriptive Analysis (Contd…)

3. Mode: It is the value that has the highest frequency in the given dataset. The
dataset may have no mode if the frequency of all data points is the same. Also, we
can have more than one mode if we encounter two or more data points having the
same frequency.

4. Range: The range describes the difference between the largest and smallest
data point in our dataset. The bigger the range, the more is the spread of data and
vice versa.

Range = Largest data value - smallest data value


R Functions for Computing Descriptive Analysis (Contd…)

5. Variance: It is defined as an average squared deviation from the mean. It is


being calculated by finding the difference between every data point and the
average which is also known as the mean, squaring them, adding all of them, and
then dividing by the number of data points present in our dataset.

6. Standard Deviation: It is defined as the square root of the variance. It is


being calculated by finding the Mean, then subtract each number from the
Mean which is also known as average and square the result. Adding all the
values and then divide by the no of terms followed the square root.
Exploratory Data Analysis
Exploratory Data Analysis or EDA is a statistical approach or technique for analysing datasets to
summarise their important and main characteristics generally by using some visual aids. The EDA
approach can be used to gather knowledge about the following aspects of data:

• Main characteristics or features of the data.

• The variables and their relationships.

• Finding out the important variables that can be used in our problem.

EDA is an Iterative Approach that Includes

• Generating questions about our data

• Searching for the answers by using visualisation, transformation, and modeling of our data.

• Using the lessons that we learn to refine our set of questions or to generate a new set of questions.
Exploratory Data Analysis in R

In R Language, we are going to perform EDA under two broad classifications:

• Descriptive Statistics, which includes mean, median, mode, inter-quartile


range, and so on.

• Graphical Methods, which includes histogram, density estimation, box plots,


and so on.
Data Inspection for EDA in R
To ensure that we are dealing with the right information we need a clear view of your
data at every stage of the transformation process. Data Inspection is the act of viewing
data for verification and debugging purposes, before, during, or after a translation.

Descriptive Statistics in EDA

For Descriptive Statistics to perform EDA in R, we will divide all the functions into the
following categories:

• Measures of central tendency

• Measures of dispersion

• Correlation
Graphical Method in EDA

Since we have already checked our data for missing values, blatant errors, and
typos, we can now examine our data graphically to perform EDA. We will see the
graphical representation under the following categories:

• Distributions

• Scatter and Line plot


Visualisation before Analysis
• Data visualisation is the practice of translating information into a visual context,
such as a map or graph, to make data easier for the human brain to understand
and pull insights from.
• The main goal of data visualisation is to make it easier to identify patterns, trends
and outliers in large datasets. The term is often used interchangeably with others,
including information graphics, information visualisation and statistical graphics.
• Data visualisation is one of the steps of the data science process, which states that
after data has been collected, processed and modeled, it must be visualised for
conclusions to be made.
• Data visualisation is also an element of the broader data presentation architecture
(DPA) discipline, which aims to identify, locate, manipulate, format and deliver
data in the most efficient way possible.
Visualisation before Analysis (Contd…)
• Data visualisation is important for almost every career. It can be used by
teachers to display student test results, by computer scientists exploring
advancements in artificial intelligence (Al) or by executives looking to share
information with stakeholders.
• It also plays an important role in big data projects. As businesses
accumulated massive collections of data during the early years of the big data
trend, they needed a way to quickly and easily get an overview of their data.
• Visualisation is central to advanced analytics for similar reasons. When a data
scientist is writing advanced predictive analytics or machine learning (ML)
algorithms, it becomes important to visualise the outputs to monitor results and
ensure that models are performing as intended. This is because Visualisations of
complex algorithms are generally easier to interpret than numerical outputs.
Why is Data Visualisation Important?

• Data visualisation provides a quick and effective way to communicate


information in a universal manner using visual information.

• The practice can also help businesses identify which factors affect customer
behaviour; pinpoint areas that need to be improved or need more attention;
make data more memorable for stakeholders; understand when and where
to place specific products; and predict sales volumes.
Benefits of Data Visualisation
• The ability to absorb information quickly, improve insights and make faster
decisions;
• An increased understanding of the next steps that must be taken to improve the
organisation;
• An improved ability to maintain the audience's interest with information they can
understand;
• An easy distribution of information that increases the opportunity to share insights
with everyone involved;
• Eliminate the need for data scientists since data is more accessible and
understandable; and
• An increased ability to act on findings quickly and, therefore, achieve success with
greater speed and less mistakes.
R Data Visualisation

R Visualisation Packages R provides a series of packages for data visualisation. These


packages are as follows:

1. plotly: The plotly package provides online interactive and quality graphs. This package
extends upon the JavaScript library ?plotly.js.

2. ggplot2: R allows us to create graphics declaratively. R provides the ggplot package for
this purpose. This package is famous for its elegant and quality graphs, which sets it apart
from other visualisation packages.

3. tidyquant: The tidyquant is a financial package that is used for carrying out quantitative
financial analysis. This package adds under tidyverse universe as a financial package that is
used for importing, analysing, and visualizing the data.
R Data Visualisation (Contd…)
4. taucharts: Data plays an important role in taucharts. The library provides a declarative
interface for rapid mapping of data fields to visual properties.

5. ggiraph: It is a tool that allows us to create dynamic ggplot graphs. This package allows
us to add tooltips, JavaScript actions, and animations to the graphics.

6. geofacets: This package provides geofaceting functionality for 'ggplot2'. Geofaceting


arranges a sequence of plots for different geographical entities into a grid that preserves
some of the geographical orientation.

7. googleVis: googleVis provides an interface between R and Google's charts tools. With
the help of this package, we can create web pages with interactive charts based on R data
frames.
R Data Visualisation (Contd…)

8. RColourBrewer: This package provides colour schemes for maps and other
graphics, which are designed by Cynthia Brewer.

9. dygraphs: The dygraphs package is an R interface to the dygraphs JavaScript


charting library. It provides rich features for charting time-series data in R.

10. shiny: R allows us to develop interactive and aesthetically pleasing web apps
by providing a shiny package. This package provides various extensions with
HTML widgets, CSS, and JavaScript.
R Data Visualisation (Contd…)
Data Visualisation in R
Data visualisation is the technique used to deliver insights in data using visual cues such as
graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding it.

Data Visualisation in R Programming Language


• The popular data visualisation tools that are available are Tableau, Plotly, R, Google Charts,
Infogram, and Kibana. The various data visualisation platforms have different capabilities,
functionality, and use cases. They also require a different skill set.

• R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualisation as it offers flexibility and
minimum required.
Types of Data Visualisations
1. Bar Plot : There are two types of bar plots-horizontal and vertical which
represent data points as horizontal or vertical bars of certain lengths
proportional to the value of the data item. They are generally used for
continuous and categorical variable plotting. By setting the horiz parameter to
true and false, we can get horizontal and vertical bar plots, respectively.
2. Histogram: It is like a bar chart as it uses bars of varying height to represent
data distribution. However, in a histogram values are grouped into consecutive
intervals called bins. In a Histogram, continuous values are grouped and
displayed in these bins whose size can be varied.
3. Box Plot: The statistical summary of the given data is presented graphically
using a boxplot. A boxplot depicts information like the minimum and
maximum data point, the median value, first and third quartile, and
interquartile range.
Analytics for Unstructured Data
• Unstructured data is data that doesn't have a fixed form or structure. Images,
videos, audio files, text files, social media data, geospatial data, data from
IoT devices, and surveiliance data are examples of unstructured data. About
80%-90% of data is unstructured. Businesses process and analyse unstructured
data for different purposes, like improving operations and increasing revenue.
Unstructured data analysis is complex and requires specialised techniques,
unlike structured data, which is straightforward to store and analyse.
• Here is a quick glance at all the unstructured data analysis techniques and tips
as:
1. Keep the business objective(s) in mind
2. Define metadata for faster data access
3. Choose the right analytics techniques
4. Qualitative data analysis techniques
Analytics for Unstructured Data (Contd…)

5. Exploratory data analysis techniques


6. Artificial Intelligence (Al) and Machine Learning (ML) techniques
7. Identify the right data sources
8. Evaluate the technologies you'd want to use
9. Get real-time data access
10. Store and integrate data using data lake
Tips to Analyse Unstructured Data
1. Start with your end goal in mind: Start with a solid idea of what you want to
accomplish. Text analysis methods, like keyword extraction, sentiment analysis, and
topic classification, allow you to pull opinions and ideas from text, then organise and
analyse them more thoroughly for quantitative and qualitative results, so the
possibilities are vast.
2. Collect Unstructured Data: Once you've decided what you want to accomplish,
you need to find your data. Make sure to use data sources that are relevant to your
topic and the goals you set, like customer surveys and online reviews.
• Whatever technique you use, make sure no data is lost. Databases and data warehouses
can provide access to structured data. But "data lakes" - repositories that store data in
its raw format- offer better access to unstructured data and retain all useful
information.
• Tools like Monkey Learn allow you to connect directly to Twitter or pull data from
other social media sites, news articles, etc. As data moves fast in our current business
climate, you'll want to learn how to collect real-time data to stay on top of your brand
image.
Tips to Analyse Unstructured Data (Contd…)
3. Clean Unstructured Data: Unstructured text data often comes with repetitive
text or irrelevant text and symbols, like email signatures, URL links, emojis,
banner ads, etc. This information is unnecessary to your analysis and will only
skew the results, so it's important you learn how to clean your data.
You can start with some simple word processing tasks, like running spell check,
removing repetitious words, special characters, and URL links, or give a quick
read to make sure words are used correctly.
4. Structure your Unstructured Data: Text analysis machine learning programs
use natural language processing algorithms to break down unstructured text data.
Data preparation techniques like tokenisation, part-of-speech tagging, stemming,
and lemmatization effectively transform unstructured text into a format that can
be understood by machines. This is then compared to similarly prepared data in
search of patterns and deviations to make interpretations.
Tips to Analyse Unstructured Data (Contd…)
5. Analyse your Unstructured Data: Once the data is structured, you're ready
for analysis. Depending on your goals, you can calculate whatever metrics you
need. SaaS tools allow you to pick and choose from many different extraction
and classification techniques and use them in concert to get a view of the big
picture or super minute details.
Maybe you're following a new product launch or marketing campaign, and you
need to know how customers feel about it. You can extract data from social
media posts or online reviews relating only to the subject you need, perform
sentiment analysis on them, and follow the sentiment over time.
6. Visualise your Analysis Results: Creating charts and graphs to visualise your
data can make analyses much easier to comprehend and compare. Monkey Learn
Studio is an all-in-one business intelligence platform where you can perform all
the above in one single interface and then visualise your results in striking detail
for an interactive data experience.
Unstructured Data Analysis Techniques
• Exploratory Data Analysis: Exploratory Data Analysis (EDA) is a set of
initial investigations done to identify the main characteristics of data. It is done
using summary statistics and graphics. EDA includes multiple techniques,
including:
• Quantitative data analysis: Quantitative data analysis techniques give
discrete values and results. These techniques include mathematical and
statistical analysis like finding the mean, correlations, range, standard
deviation, labeling data (classification), regression analysis techniques, cluster
analysis, text analytics, keyword search, and hypothesis testing using random
sample data. The MongoDB aggregation framework provides rich capabilities
for quantitative analysis. You can also use unstructured data analysis tools like
R/Python for advanced unstructured data analytics for data stored in
MongoDB.
Unstructured Data Analysis Techniques (Contd…)
• Visualisation techniques: Exploratory data analysis often uses visual methods to
uncover relationships between the data variables. You can easily identify patterns
and eliminate outliers and anomalies. Some popular techniques are dimensionality
reduction, graphical techniques like multivariate charts, histogram, box plots, and
more. For example, flow maps can show how many people travel to and from New
York City per day. Pie charts are a great way to explore data distributions across
various categories, including which age groups of people like to read books or
watch TV and so on. MongoDB Charts presents a unified view of all your
MongoDB Atlas data and quickly provides rich visual insights.
Unstructured Data Analysis Techniques (Contd…)
• Qualitative Data Analysis: Qualitative data analysis mainly applies for
unstructured text data. This can include documents. surveys, interview
transcripts, social media content, medical records, and sometimes audio and
video clips, as well. These techniques need reasoning, contextual
understanding, social intelligence, and intuition rather than a mathematical
formula (like in quantitative analysis) Content analysis, discourse analysis, and
narrative analysis are some types of qualitative analysis There are two
approaches for qualitative data analysis:
1. Inductive-Data analysts and researchers already have a theory and collect data
and facts to accurately validate and prove the theory.
2. Deductive-Data analysts have the data in hand and derive insights from the
data to develop a theory.
Unstructured Data Analysis Techniques (Contd…)

• Al and ML: Al and ML unstructured data analysis techniques include decision


trees, Principal Component Analysis (PCA), Natural Language Processing (NLP),
artificial neural networks, imagenponent temporal modeling techniques, market
segmentation analysis, and more. These techniques help with predictive analytics
and uncovering the data insights. Suppose you ordered a shipmem of 100 bicycles
and want to track its delivery status at different times-temporal modeling techniques
will do that for you! Similarly, to know how people are reacting to your new ad
campaign-positive or negative use sentiment analysis (NLP technique). MongoDB
is a great choice for training ML models because of its flexible data model.
Unstructured Data Analysis Challenges

Unstructured data analysis has a potential to generate huge business insights. However,
traditional storage and analysis techniques are not sufficient to handle unstructured data.
Here are some of the challenges that companies face in analysing unstructured data:

1. Big Data characteristics: volume, velocity, variety

2. Data reliability and consistency

3. Data protection

4. Complex nature of data management

5. Data migration

6. Cognitive bias
Unstructured Data Analytics Tools

• Unstructured data analytics tools use machine learning to gather and analyse data that has
no pre-defined framework-like human language. Natural language processing (NLP)
allows software to understand and analyse text for deep insights, much as a human would.

• Unstructured data analysis can help your business answer more than just the "What is
happening?" of numbers and statistics and go into qualitative results to understand, "Why
is this happening?“

• Monkey Learn is a SaaS platform with powerful text analysis tools to pull real-world and
real-time insights from your unstructured information, whether it's public data from the
internet, communications between your company and your customers, or almost any other
source.
Unstructured Data Analytics Tools (Contd…)

• Among the most common and most useful tools for unstructured data analysis is:

1. Sentiment analysis to automatically classify text by sentiment (positive,


negative, neutral) and read for the opinion and emotion of the writer.

2. Keyword extraction to pull the most used and most important keywords from
text: find recurring themes and summarise whole pages of text.

3. Intent and email classification to understand the intent of a comment or query


and automatically review emails for level of interest.
Difference between Data Analytics and Data Visualization
Based on Data Visualization Data Analytics

Data visualization is the graphical Data analytics is the process of analyzing data sets in order to make
Definition representation of information and data in a decision about the information they have, increasingly with
pictorial or graphical format. specialized software and system.

Identify areas that need attention or


Identify the underlying models and patterns Acts as an input source
improvement Clarity which factors influence
Benefits for the Data Visualization, Helps in improving the business by
customer behavior Helps understand which
predicting the needs Conclusion
products to places where Predict sales volumes

The goal of the data visualization is to


communicate information clearly and Every business collects data; data analytics will help the business
Used for
efficiently to users by presenting them to make more-informed business decisions by analyzing the data.
visually.

Together Data visualization and analytics will draw the conclusions


Data visualization helps to get better
Relation about the datasets. In few scenarios, it might act as a source for
perception.
visualization.
Difference between Data Analytics and Data Visualization (Contd…)
Based on Data Visualization Data Analytics

Data Visualization technologies and Data Analytics technologies and techniques are
Industries techniques are widely used in Finance, widely used in Commercial, Finance, Healthcare,
Banking, Healthcare, Retailing etc Crime detection, Travel agencies etc

Trifecta, Excel /Spreadsheet, Hive, Polybase, Presto,


Plotly, DataHero, Tableau, Dygraphs,
Tools Trifecta, Clear Analytics, SAP Business Intelligence,
QlikView, ZingCHhart etc.
etc.

Big data processing, Service management Big data processing, Data mining, Analysis and
Platforms
dashboards, Analysis and design. design

Data visualization can be static or Data Analytics can be Prescriptive analytics,


Techniques
interactive. Predictive analytics.

Performed by Data Engineers Data Analyst


Comparison between CLIQUE and PROCLUS
Aspect CLIQUE (CLustering In QUEst) PROCLUS (Projected
CLUStering)
Type Density-based subspace clustering Partition-based projected clustering
Clustering Identifies dense grid cells in Projects data into subspaces and
Approach subspaces to form cluster assigns medoids to form clusters

Subspace Axis-aligned subspaces (fixed grid Flexible projected subspaces tailored


Handling structure in each dimension) for each cluster

Cluster Shape Grid-like and axis-aligned Arbitrary, based on projections


Dimensionality Operates directly on subspaces of all Selects relevant dimensions for each
dimensions cluster
Comparison between CLIQUE and PROCLUS (Contd…)
Aspect CLIQUE (CLustering In QUEst) PROCLUS (Projected CLUStering)
Input Parameters - Grid size (granularity) - Number of clusters
- Density threshold - Average subspace dimensionality
Algorithm - Divides data space into equal-sized grid - Selects a set of medoids
Workflow cells - Assigns dimensions to clusters
- Identifies dense regions in each subspace - Iteratively refines clusters and their
- Merges dense regions to form clusters subspaces
Strengths - Finds clusters in axis-aligned subspaces - Identifies clusters in arbitrary projected
- Handles noise effectively subspaces
- Minimal input parameters required - Works well with datasets containing
clusters in varying dimensions
Weaknesses - Limited to axis-aligned clusters - Requires number of clusters and subspace
- May miss clusters in non-aligned dimensionality as input
subspaces - Sensitive to initialization
Applications - Market basket analysis - Gene expression analysis
- Simple pattern detection in large datasets - High-dimensional clustering
Difference Between Data Science and Data Analytics

Feature Data Science Data Analytics


Python is the most used language for
The Knowledge of Python and R
data science along with the use of
Coding Language Language is essential for Data
other languages such as C++, Java,
Analytics.
Perl, etc.

Programming In-depth knowledge of programming is Basic Programming skills is necessary


Skills required for data science. for data analytics.

Use of Machine Data Science makes use of machine Data Analytics does not use machine
Learning learning algorithms to get insights. learning to get the insight of data.

Data Science makes use of Data


Hadoop Based analysis is used for
Other Skills mining activities for getting
getting conclusions from raw data.
meaningful insights.
Difference Between Data Science and Data Analytics (Contd…)

Feature Data Science Data Analytics

The Scope of data analysis is micro


Scope The scope of data science is large.
i.e., small.

Data science deals with explorations Data Analysis makes use of existing
Goals
and new innovations. resources.

Data Science mostly deals with Data Analytics deals with structured
Data Type
unstructured data. data.

Statistical skills are necessary in the The statistical skills are of minimal or
Statistical Skills
field of Data Science.. no use in data analytics.
Differentiation between NoSQL and RDBMS
Aspect NoSQL RDBMS
Structure Flexible schema; key-value, document, Rigid schema; data stored in tables with rows and
graph, or wide-column formats. columns.

Query No standard language (e.g., JSON, CQL, Uses SQL (Structured Query Language).
Language APIs).
Scalability Horizontally scalable (adding more Vertically scalable (upgrading hardware).
servers).
Optimized for large-scale, high- Efficient for complex queries and transactions.
Performance throughput operations.

Data Typically, eventual consistency (CAP Strong consistency due to ACID compliance.
Consistency theorem trade-offs).

Use Cases Suitable for unstructured data, real-time


analytics, and big data. Suitable for structured data and applications
needing strict consistency.
Data Mining

Data mining refers to the process of discovering patterns, relationships, or insights in


large datasets using statistical, machine learning, and data analysis techniques. It
involves extracting useful information from raw data to help make informed decisions.
Key components of data mining include:

• Classification: Categorizing data into predefined groups.

• Clustering: Grouping similar data points together.

• Association Rules: Identifying relationships between variables (e.g., market basket


analysis).

• Prediction: Forecasting future trends based on historical data.


Data Warehousing

A data warehouse is a centralized repository designed to store, manage, and query


large volumes of data from multiple sources. It is optimized for analytical purposes
rather than transactional workloads. Key features include:

• Data Integration: Aggregating data from various sources into a unified format.

• Historical Data Storage: Keeping long-term data for trend analysis.

• Query Optimization: Designed for efficient data retrieval using SQL and OLAP
(Online Analytical Processing).

• Decision Support: Facilitates business intelligence (BI) tasks like reporting,


dashboards, and data visualization.
Machine Learning

• Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on


developing algorithms and models that enable computers to learn from and
make predictions or decisions based on data, without being explicitly
programmed for specific tasks. It automates the process of pattern recognition
and adapts to new data over time.
Types of Machine Learning

1. Supervised Learning: The model is trained on labeled data (input-output


pairs).Examples:
•Regression: Predicting continuous values (e.g., house prices).
•Classification: Assigning categories (e.g., spam detection).
2. Unsupervised Learning: The model identifies patterns or groupings in
unlabeled data. Examples:
•Clustering: Grouping similar items (e.g., customer segmentation).
•Dimensionality Reduction: Simplifying data (e.g., PCA).
3. Reinforcement Learning: The model learns through trial and error by
interacting with an environment to maximize a reward.
•Examples: Training robots, game playing (e.g., AlphaGo).
Data Mining Task Primitives

Data mining task primitives are the basic building blocks or operations used in
the data mining process to define and execute tasks for extracting patterns,
insights, or knowledge from data. These primitives help specify what type of
data mining task is being performed and the parameters involved.
Purpose of Task Primitives
• These primitives serve as a framework to guide the data mining process,
ensuring that:
• Goals are clearly defined.
• Techniques are appropriately applied.
• Results are interpretable and actionable.
Cloud Computing
Cloud Computing refers to the delivery of computing services—including
servers, storage, databases, networking, software, analytics, and intelligence—
over the internet ("the cloud"). It allows users to access and use resources on
demand without needing to own or manage physical hardware or infrastructure.
Advantages of Cloud Computing

•Cost Efficiency: Reduces the need for upfront hardware investment.


•Flexibility: Access services anytime, anywhere.
•Scalability: Handle fluctuating workloads efficiently.
•Security: Providers offer robust security features and compliance options.
•Disaster Recovery: Ensures business continuity with backups and redundancy.
Applications of Cloud Computing

•Business: Customer Relationship Management (CRM), ERP systems.


•Healthcare: Storing patient records securely.
•Education: Online learning platforms.
•Entertainment: Streaming services like Netflix and Spotify.
•Technology Development: Hosting and testing software applications .

You might also like