DA Unit 5
DA Unit 5
Unit 5
Frame Works and Visualization
Syllabus
• Frameworks provide a set of tools and elements that help in the speedy
development process. It acts like a template that can be used and even
modified to meet the project requirements.
• It involves transforming complex data sets into visuals like charts, graphs,
maps, or even interactive dashboards, allowing people to spot trends,
patterns, and insights quickly.
MapReduce
1. Map: The input data is divided into smaller, manageable chunks, which are
processed in parallel. In this step, each chunk is analyzed or transformed, and a
key-value pair is generated.
MapReduce (Contd…)
2. Reduce: The output from the map step is then grouped by key and processed to
combine the values. This step summarizes the data based on the specific problem
requirements. For instance, in the word count example, the reduce function would
aggregate counts for each unique word across all chunks, providing a total count.
• MapReduce is highly scalable and works well with distributed storage systems
like Hadoop Distributed File System (HDFS). It’s widely used in big data
applications where data sets are too large to fit on a single machine, such as in
search engines, recommendation systems, and large-scale analytics.
MapReduce Architecture
Components of MapReduce Architecture
1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to process or
execute.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
1. It is fault tolerance.
2. It is highly available.
5. It is low cost.
Advantages of Hadoop
• High flexibility.
• Cost effective.
• Linear scaling.
Disadvantages of Hadoop
• Security concerns.
Pig
• Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets.
• It provides a high-level of abstraction for processing over the MapReduce. It provides a high-
level scripting language, known as Pig Latin which is used to develop the data analysis codes.
• First, to process the data which is stored in the HDFS, the programmers will write the scripts
using the Pig Latin Language. Internally Pig Engine(a component of Apache Pig) converted
all these scripts into a specific map and reduce task. But these are not visible to the programmers
in order to provide a high-level of abstraction.
• Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of
Pig always stored in the HDFS.
Need of Pig
• It uses query approach which results in reducing the length of the code.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a
boon..
• Pig can handle the analysis of both structured and unstructured data.
Applications of Pig
• For collecting large amounts of datasets in form of search logs and web crawls.
• Used where the analytical insights are needed using the sampling.
Apache Pig Architecture
• The language used to analyse data in Hadoop using Pig is known as Pig Latin. It is
high-level data processing language which provides a rich set of data types and
operators to perform various operations on the data.
• To perform a particular task Programmers using Pig, programmers need to write a Pig
script using the Pig Latin language, and executes them using any of the execution
mechanisms. After execution, these scripts will go through a series of transformation
applied by the Pig Framework, to produce the desired output.
• Internally, Apache Pig converts these scripts into a series of Map Reduce jobs, and
thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown
below as:
Apache Pig Architecture (Contd…)
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig framework as discussed
below:
• Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script,
does type checking, and other miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin statements and logical operators. In
the DAG, the logical operators of the script are represented as the nodes and the data flows are
represented as edges.
• Optimiser: The logical plan (DAG) is passed to the logical optimiser, which carries out the
logical optimisations such as projection and pushdown.
• Compiler: The compiler compiles the optimised logical plan into a series of MapReduce jobs.
Apache Pig Components (Contd…)
• Pig Latin Data Model: The data model of Pig Latin is fully nested, and it
allows complex non-atomic datatypes such as map and tuple.
• Atom: Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int, long,
float, double, chararray, and bytearray are the atomic values of Pig. A piece of
data or a simple atomic value is known as a field.
Apache Pig Components (Contd…)
• Tuple: A record that is formed by an ordered set of fields is known as a tuple; the fields can
be of any type. A tuple is like a row in a table of RDBMS. Example: (Raja, 30)
• Bag: A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique)
is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is
represented by { }. It is like a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that the fields in the same
position (column)have the same type. Example: ((Raja, 30), (Mohammad, 45))
• Map: A map (or data map) is a set of key-value pairs. The key needs to be of type char array
and should be unique. The value might be of any type. It is represented by '[]’. Example:
(name#Raja, age#30]
Difference between Pig and MapReduce
Apache Pig MapReduce
• It is a scripting language. • It is a compiled programming language.
• It allows nested data types like map, tuple and bag • It does not allow nested data types
Difference between Pig and SQL
Pig SQL
Hive is not
• A relational database
• It allows different storage types such as plain text, RCFile, and HBase.
• It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
• HBase is a data model that is like Google's big table. It is an open source, distributed
database developed by Apache software foundation written in Java.
• HBase is an essential part of our Hadoop ecosystem. HBase runs on top of HDFS
(Hadoop Distributed File System). It can store massive amounts of data from
terabytes to petabytes. It is column oriented and horizontally scalable.
• HBase is a data model that is like Google's big table designed to provide
quick random access to huge amounts of structured data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe, use Hbase internally.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The
table schema defines only column families, which are the key value pairs. A table
has multiple column families, and each column family can have any number of
columns. Subsequent column values are stored contiguously on the disk. Each cell
value of the table has a timestamp. In short, in an Hbase:
Column-oriented databases are those that store data tables as sections of columns
of data, rather than as rows of data. Shortly, they will have column families.
It provides only sequential access of data. HBase internally uses Hash tables
and provides random access, and it
stores the data in indexed HDFS
files for faster lookups.
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its schema,
concept of fixed columns schema; defines which describes the whole structure of
only column families. tables.
It is built for wide tables. HBase is It is thin and built for small tables. Hard to
horizontally scalable. scale.
No transaction are there in HBase. RDBMS is transactional.
It has deformalized data. It will have normalized data.
It is good for semi-structured as well as It is good for structured data.
structured data.
MapR
MapR was a data platform that provided a distribution of Apache Hadoop along with
additional tools and capabilities for big data processing and analytics. It aimed to
simplify the management, storage, and processing of large volumes of data across
various environments, including on-premises and cloud.
• Sharding is a very important concept that helps the system to keep data in
different resources according to the sharding process. The word “Shard”
means “a small part of a whole“.
• Horizontal scaling allows for near-limitless scalability to handle big data and intense
workloads.
• In contrast, vertical scaling refers to increasing the power of a single machine or single
server through a more powerful CPU, increased RAM, or increased storage capacity.
Sharding (Contd…)
Why Sharding ?
• NoSQL databases (aka 'not only SQL) are non-tabular databases and store
data differently than melational tables. NoSQL databases come in a variety of
types based on their data model. The main types are document, key-value, wide-
column, and graph. They provide flexible schemas and scale easily with large
amounts of data and high user loads.
• NoSQL databases emerged in the late 2000s as the cost of storage dramatically
decreased. Gone were the days of needing to create a complex, difficult-to-
manage data model to avoid data duplication. Developers (rather than storage)
were becoming the primary cost of software development, so NoSQL databases
optimised for developer productivity.
NoSQL Database Features
Each NoSQL database has its own unique features. At a high level, many
NoSQL databases have the following features:
• Flexible schemas
• Horizontal scaling
4. GUI is not available: GUI mode tools to access the database are not flexibly
available in the market.
5. Backup: Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a consistent
manner.
6. Large document size: Some database systems like MongoDB and CouchDB
store data in JSON format. This means that documents are quite large (Big Data,
network bandwidth, speed), and having descriptive key names actually hurts since
they increase the document size.
S3 (Simple Storage Service)
• S3 is a safe place to store the files. It is Object-based storage, i.e., you can
store the images, word files, pdf files, etc. The files which are stored in S3 can
be from 0 Bytes to 5 TB. It has unlimited storage means that you can store the
data as much you want. Files are stored in Bucket. A bucket is like a folder
available in S3 that stores the files.
1. Create Buckets: Firstly, we create a bucket and provide a name to the bucket
Buckets are the containers in S3 that stores the data. Buckets must have a
unique name to generate a unique DNS address.
3. Download data: You can also download your data from a bucket and can also give
permission to others to download the same data. You can download the data at any time
whenever you want.
4. Permissions: You can also grant or deny access to others who want to download or upload
the data from your Amazon S3 bucket. Authentication mechanism keeps the data secure from
unauthorized access.
5. Standard interfaces: S3 is used with the standard interfaces REST and SOAP interfaces
which are designed in such a way that they can work with any development toolkit.
2. Objects
➤ An object consists of some default metadata such as date last modified, and
standard HTTP metadata, such as Content type. Custom metadata can also be
specified at the time of storing an object.
3. Key
4. Regions
You can choose a geographical region in which you want to store the buckets
that you have created.
A region is chosen in a such a way that it optimises the latency, minimise costs
or address regulatory.
Objects will not leave the region unless you explicitly transfer the objects to
another region.
Hadoop Distributed File System (HDFS)
• Hadoop comes with a distributed file system called HDFS. In HDFS data is
distributed over several machines and replicated to ensure their durability to
failure and high availability to parallel application. It is cost effective as it uses
commodity hardware. It involves the concept of blocks, data nodes and node
name.
Where to use HDFS
2. Streaming Data Access: The time to read whole dataset is more important
than latency in reading the first. HDFS is built on write-once and read-many-
times pattern.
1. Low Latency data access: Applications that require very less time to access
the first data should not use HDFS as it is giving importance to whole data
rather than time to fetch the first record.
2. Lots of Small Files: The name node contains the metadata of files in
memory and if the files are small it takes a lot of memory for name node's
memory which is not feasible.
3. Multiple Writes: It should not be used when we must write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default, and this is configurable. Files n HDFS are
broken into block-sized chunks, which are stored as independent units. Unlike a
file system, if the file is in HDFS is smaller than block size, then it does not
occupy full blocks size, i.e., 5 MB of file stored in HDFS of block size 128 MB
takes 5MB of space only. The HDFS block size is large just to minimise the cost
of seek.
HDFS Concepts (Contd…)
2. Name Node: HDFS works in master-worker pattern where the name node acts
as master. Name Node is controller and manager of HDFS as it knows the status
and the metadata of all the files in HDFS; the metadata information being file
permission, names and location of each block. The metadata are small, so it is
stored in the memory of name node, allowing faster access to data. Moreover, the
HDFS cluster is accessed by multiple clients concurrently, so all this information
is handled by a single machine. The file system operations like opening, closing,
renaming etc. are executed by it.
HDFS Concepts (Contd…)
3. Data Node: They store and retrieve blocks when they are told to; by client or
name node. They report back to name node periodically, with list of blocks that
they are storing. The data node being a commodity-hardware also does the work
of block creation, deletion and replication as stated by the name node.
HDFS Concepts (Contd…)
Starting HDFS
• The HDFS should be formatted initially and then started in the distributed
mode. Commands are given below.
To Format $ Hadoopnamenode-format
To Start $ start-dfs.sh
HDFS Basic File Operations
1. Putting data to HDFS from local file system
• First create a folder in HDFS where data can be put form local file system.
$ hadoopfs-mkdir /user/test
• Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFSfolder/user/test$ hadoopfs-
copyFromLocal /usr/home/Desktop/data.txt/user/test
$md5/usr/bin/data_copy.txt /usr/home/Desktop/data.txt
Recursive deleting
hadoopfs-mr<arg>
• Replication: Due to some unfavorable conditions, the node containing the data
may be loss. So, to overcome such problems, HDFS always maintains the copy
of data on a different machine.
• Fault tolerance: In HDFS, the fault tolerance signifies the robustness of the
system in the event of failure. The HDFS is highly fault-tolerant that if any
machine fails, the other machine containing the copy of that data automatically
becomes active.
Features of HDFS (Contd…)
• Distributed data storage: This is one of the most important features of HDFS
that makes Hadoop very powerful. Here, data is divided into multiple blocks and
stored into nodes.
• Portable: HDFS is designed in such a way that it can easily be portable from
platform to another
Goals of HDFS
• Handling the hardware failure: The HDFS contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
• Streaming data access: The HDFS applications usually run on the general-
purpose file system. This application requires streaming access to their datasets.
• Coherence Model: The application that runs on HDFS require to follow the
write- once-ready-many approach. So, a file once created need not to be
changed. However, it can be appended and truncate.
Visual Data Analysis
• Data visualisation convert large and small datasets into visuals, which is easy to
understand and process for humans.
• Data Visualisation Techniques uses charts and graphs to visualise large amounts of
complex data. Visualisation provides a quick and easy way to convey concepts,
summarise and present large data in easy-to-understand and straightforward displays,
which enables reader's insightful information.
• Data visualisation is one of the steps of the data science and data analytics process, which
states that after data has been collected, processed and modeled, it must be visualized for
conclusions to be made.
• Data visualisation is also an element of the broader data presentation architecture (DPA)
discipline, which aims to identify, locate, manipulate, format and deliver data in the most
efficient way possible.
Features of Data Visualisation
• Decision-making Ability.
• Integration Capability.
1. Identifying the purpose of creating a chart is necessary as this helps define the
structure of the process.
3. Selecting the right type of chart is very crucial as this defines the overall
functionality of the chart.
5. Choosing the correct type of colour, shape, and size is essential for
representing the chart.
Identify the Purpose of the Visualisation (Contd…)
Challenges of Data Visualisation
• Big Data is a large volume, complex dataset. So, such data cannot visualise with
the traditional method as the traditional data visualisation method has many
limitations.
1. Perceptual Scalability: Human eyes cannot extract all relevant information
from a large volume of data. Even sometimes desktop screen has its limitations if
the dataset is large. Too many visualizations are not always possible to fit on a
single screen.
2. Real-time Scalability: It is always expected that all information should be real-
time information, but it is hardly possible as processing the dataset needs time.
3. Interactive Scalability: Interactive data visualisation help to understand what is
inside the datasets, but as big data volume increases exponentially, visualizing
the datasets take a long time. But the challenge is that sometimes the system may
freeze or crash while trying to visualise the datasets.
Data Visualisation Techniques
1. Line Charts: Line Charts involve creating a graph where data is represented
as a line or as a set of data points joined by a line.
Data Visualisation Techniques (Contd…)
2. Area Chart: Area chart structure is a filled-in area that requires at least two
groups of data along an axis.
Data Visualisation Techniques (Contd…)
3. Pie Charts: Pie charts represent a graph in the shape of a circle. The whole
chart is divided into subparts, which look like a sliced pie.
Data Visualisation Techniques (Contd…)
4. Donut Chart: Doughnut charts are pie charts that do not contain any data
inside the circle.
Data Visualisation Techniques (Contd…)
5. Drill Down Pie Charts: Drill down pie charts are used for representing
detailed description for a particular category.
Data Visualisation Techniques (Contd…)
6. Bar Charts: A bar chart is the type of chart in which data is represented in
vertical series and used to compare trends over time.
Data Visualisation Techniques (Contd…)
7. Scatter and Bubble Charts: Creates a chart in which the position and size of
bubbles represent data. Use to show similarities among types of values, mainly
when you have multiple data objects, and you require to see the general relations.
Data Visualisation Techniques (Contd…)
8. 3D Charts: Creating a 3D chart rotate and view a chart from different angles,
which supports in representing data.
Data Visualisation Process Flow and Stages
Each data has its need to illustrate data. Below are the stages and process flow for Data
Visualisation.
1. Acquire: Obtaining the correct data type is a crucial part as the data can be collected
from various sources and can be unstructured.
2. Parse: Provide some structure for the data's meaning by restructuring the received
data into different categories, which helps better visualise and understand data.
3. Filter: Filtering out the data that cannot serve the purpose is essential as filtering out
will remove the unnecessary data, further enhancing the chart visualisation.
Data Visualisation Process Flow and Stages (Contd…)
4. Mining: Building charts from statistics in a way that scientific context is
discrete. Data visualisation helps viewers seek insights that cannot be gained
from raw data or statistics.
5. Represent: One of the most significant challenges for users is deciding which
chart suites best and represents the right information. The data exploration
capability is necessary to statisticians as this reduces the need for duplicated
sampling to determine which data is relevant for each model.
6.Refine: Refining and improving the essential representation helps in user
engagement.
7. Interact: Add methods for handling the data or managing what features are
visible.
Big Data Visualisation Tools
Nowadays, there are many data visualisation tools. Some of them are:
1. Google Chart: Google Chart is one of the easiest tools for visualisation.
With the help of google charts, you can analyse small datasets to complex
unstructured datasets. We can implement simple charts as well as complex
tree diagrams. Google Chart is available cross-platform as well.
2. Tableau: The tableau desktop is a very easy-to-use big data visualisation
tool. Two more versions are available of Tableau. One is 'Tableau Server,' and
the other is cloud-based 'Tableau Online'. Here we can perform visualisation
operations by applying drag and drop methods for creating visual diagrams.
In Tableau, we can create dashboards very efficiently.
3. Microsoft Power BI: This tool is mainly used for business analysis.
Microsoft Power BI can be run from desktops, smartphones, and even tablets.
This tool also provides analysis results very quickly.
Big Data Visualisation Tools (Contd…)
5. Data wrapper: Data wrapper is a simple tool. Even non-technical persons can
use the Data wrapper tool. Data representation in a table format or responsive graphs
like a bar chart, line chart, or map draws quickly in the Data wrapper.
2. Fraud Detection: Fraud detection is a famous use case of big data. With
the help of visualisation tools after analysing data, a message can be
generated to others, and they will be careful about such fraud incidents.
Use Cases of Big Data Visualisation Tools (Contd…)
• Interactive data visualisation refers to the use of software that enables direct
actions to modify elements on a graphical plot.
• Interactive data visualisation refers to the use of modern data analysis software that
enables users to directly manipulate and explore graphical representations of data.
• Data visualisation uses visual aids to help analysts efficiently and effectively
understand the significance of data. Interactive data visualisation software improves
upon this concept by incorporating interaction tools that facilitate the modification
of the parameters of a data visualisation, enabling the user to see more detail, create
new insights, generate compelling questions, and capture the full value of the data.
Interactive Data Visualisation Techniques
Deciding what the best interactive data visualisation will be for your project depends on
your end goal and the data available. Some common data visualisation interactions that
will help users explore their data visualizations include:
• Next, it's time to user test in order to refine compatibility, functionality, security, the
user interface, and performance. Now you are ready to launch to your target
audience. Methods for rapid updates should be built in so that your team can stay up
to date with your interactive data visualisation.
• Some popular libraries for creating your own interactive data visualization's include
Altair, Bokeh, Celluloid, Matplotlib, interact, Plotly, Pygal, and Seaborn. Libraries
are available for Python, Jupyter, JavaScript, and R interactive data visualizations.
Scott Murray's Interactive Data Visualisation for the Web is one of the most popular
educational resources for learning how to create interactive data visualizations.
Benefits of Interactive Data Visualizations
3. Useful Data Storytelling: Humans best understand a data story when its
development over time is presented in a clear, linear fashion. A visual data story
in which users can zoom in and out, highlight relevant information, filter, and
change the parameters promotes better understanding of the data by presenting
multiple viewpoints of the data.
4. Simplify Complex Data: A large dataset with a complex data story may
present itself visually as a chaotic, intertwined hairball. Incorporating filtering
and zooming controls can help untangle and make these messes of data more
manageable and can help users glean better insights.
Introduction to R
• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New
Zealand, and is currently developed by the R Development Core Team.
• R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
• This programming language was named R, based on the first letter of first name of the two
R authors (Robert Gentleman and Ross Ihaka). P. is not only entrusted by academic, but
many large companies also use R programming language, including Uber, Google, Airbnb,
Facebook and so on.
R is used for
• Statistical inference
• Data analysis
The following are some of the main benefits realised by companies employing R
in their analytics programs:
2. Providing Deeper, More Accurate Insights: Today, most successful companies are data
driven and therefore data analytics affects almost every area of business. And while there are
a whole host of powerful data analytics tools, R can help create powerful models to analyse
large amounts of data. Analytics and statistical engines using R provide deeper, more accurate
insights for the business. R can be used to develop very specific, in-depth analyses.
3. Leveraging Big Data: R can help with querying big data and is used by many industry
leaders to leverage Big Data across the business. With R analytics, organisations can surface
new insights in their large datasets and make sense of their data. R can handle these big
datasets and is arguably as easy if not easier for most analysts to use as any of the other
analytics tools available today.
Benefits of R Analytics (Contd…)
• Studio is example of a code editor that interfaces with R for Windows, Mac
OS, and Linux platforms.
• Perhaps the most stable, full-blown GUI is R Commander, which can also run
under Windows, Linux, and Mac OS.
Graphic User Interfaces (Contd…)
• It consists of a language together with a run-time environment with a debugger, graphics, access
to system functions, and scripting.
• R offers a wide variety of statistical and graphical techniques including time series analysis,
linear and nonlinear modeling, classical statistical tests, classification, clustering, and more).
Combined with a large collection of intermediate tools for data analysis, good data handling and
storage, general matrix calculation toolbox, R offers a coherent and well-developed system which is
highly extensible. Many statisticians and data scientists use R with the command line.
R Graphics
• Graphics play an important role in carrying out the important features of the
data. Graphics are used to examine marginal distributions, relationships
between variables, and summary of very large data. It is a very important
complement for many statistical and computational techniques.
Standard Graphics
• Scatterplots
• Piecharis
• Boxplots
• Barplots etc.
We use the above graphs that are typically a single function call.
Graphics Devices
• It is something where we can make a plot to appear. A graphics device is a
window on your computer (screen device), a PDF file (file device), a Scalable
Vector Graphics (SVG) file (file device), or a PNG or JPEG file (file device).
• There are some of the following points which are essential to understand:
1. The functions of graphics devices produce output, which depends on the active
graphics device.
3. R graphical devices such as the PDF device, the JPEG device, etc. are used.
The Basics of the Grammar of Graphic
• There are some key elements of a statistical graphic. These elements are the
basics of the grammar of graphics.
The Basics of the Grammar of Graphic (Contd…)
1. Data: Data is the most crucial thing which is processed and generates an output.
2. Aesthetic Mappings: Aesthetic mappings are one of the most important
elements of a statistical graphic. It controls the relation between graphics variables
and data variables. In a scatter plot, it also helps to map the temperature variable of
a dataset into the X variable. In graphics, it helps to map the species of a plant into
the colour of dots.
3. Geometric Objects: Geometric objects are used to express each observation by
a point using the aesthetic mappings. It maps two variables in the dataset into the x,
y variables of the plot.
4. Statistical Transformations: Statistical transformations allow us to calculate
the statistical analysis of the data in the plot. The statistical transformation uses the
data and approximates it with the help of a regression line having x, y coordinates,
and counts occurrences of certain values.
The Basics of the Grammar of Graphic (Contd…)
5. Scales: It is used to map the data values into values present in the coordinate
system of the graphics device.
• Cartesian
• Plot
7. Faceting: Faceting is used to split the data into subgroups and draw sub-
graphs for each group.
Advantages of Data Visualisation in R
3. Location: Its app utilizing features such as Geographic Maps and GIS can be
particularly relevant to wider business when the location is a very relevant factor.
We will use maps to show business insights from various locations; also consider
the seriousness of the issues, the reasons behind them, and working groups to
address them.
Disadvantages of Data Visualisation in R
2. Distraction: However, at times, data visualisation apps create highly complex and fancy
graphics-rich reports and charts, which may entice users to focus more on the form than the
function. If we first add visual appeal, then the overall value of the graphic representation
will be minimal. In resource-setting, it is required to understand how resources can be best
used. And it is also not caught in the graphics trend without a clear purpose.
Graphical User Interfaces for R
Importing Data in R
• Importing data in R programming means that we can read data from external files,
write data to external files, and can access those files from outside the R
environment. File formats like CSV, XML, xlsx, JSON, and web data can be imported
into the R environment to read the data and perform data analysis, and also the data
present in the R environment can be stored in external files in the same file formats.
• The easiest form of data to import into R is a simple text file, and this will often be
acceptable for problems of small or medium scale. The primary function to import from a
text file is scan, and this underlies most of the more convenient functions discussed in
Spreadsheet- like data.
Data Import and Export (Contd…)
• However, all statistical consultants are familiar with being presented by a client
with a memory stick (formerly, a floppy disc or CD-R) of data in some
proprietary binary format. for example, 'an Excel spreadsheet' or 'an SPSS file'.
Often the simplest thing to do is to use the originating application to export the
data as a text file (and statistical consultants will have copies of the most
common applications on their computers for that purpose). However, this is not
always possible, and Importing from other statistical systems discusses what
facilities are available to access such files directly from R.
• For Excel spreadsheets, the available methods are summarised in Reading Excel
spreadsheets. In a few cases, data have been stored in a binary form for
compactness and speed of access. One application of this that we have seen
several times is imaging data, which is normally stored as a stream of bytes as
represented in memory, possibly preceded by a header. Such data formats are
discussed in Binary files and Binary connections.
Data Import and Export (Contd…)
Reading CSV Files
• CSV (Comma Separated Values) is a text file in which the values in columns are
separated by a comma. For importing data in the R programming environment,
we have to set our working directory with the setwd() function.
1. Exporting Data into Text/CSV: Exporting data into a text or a CSV file, is
the most popular and indeed common way of data export. Not only because
most of the software supports the option to export data into Text or CSV but
also because these files are supported by almost every software/programming
language that exists.
• There are two ways of exporting data into text files through R. One is using the
base R functions, and another one is using the functions from the readr package
to export data into text/CSV format.
Exporting Data in R (Contd…)
2. Using Built-in Functions: There is a popular built-in R function named write. table () which
can do the task of exporting the data into text files from R workspace. The function has two other
special cases namely write.csv() and write.delim() out of which the first one helps to export the
data into CSV format and the second one is adjusted way of write. table() where default
delimiters can be adjusted.
3. Using Functions from readr Package: The functions under the readr package are similar to
the functions available under base R. There is a minor difference in their look (write.csv() from
base R and write_csv() from readr do the same stuff). Besides the functions developed under the
readr package are using path = argument instead of file = to specify the path where the file needs
to export. The functions from the readr package exclude the row names by default.
Exporting Data in R (Contd…)
4. Exporting Data into Excel: Now, to export data into Excel from R workspace, the best
bait you could put on is the writexl package. This package allows you to export data as an
Excel file into xlsx format. It may be looking outdated at this moment but believe me, the
functions do their task with precision. Besides, we always have the compatibility of new
excel to be able to connect with the older versions.
5. Exporting Data into R Objects: There might be situations where you wanted to share
the data from R as Objects and share those with your colleagues through different systems
so that they can use it right away into their R workspace. These objects are of two
types .rda/.RData and .rds.
Exporting Data from Scripts in R Programming
• Example: Let's consider an example like name, address, email, etc. are the
attributes for the contact information.
• Perceived values for a given attribute are termed as observations. The variety of
an attribute is insisted on by the set of feasible values - nominal, binary,
ordinal, or numeric.
Types of Attributes
1. Nominal Attributes: Nominal means "relating to names". The utilities of a nominal
attribute are sign or title of objects. Each value represents some kind of category,
code or state, and so nominal attributes are also referred to as categorical.
Example: Suppose that skin colour and education status are two attributes of expressing
person objects. In our implementation, possible values for skin colour are dark, white,
brown. The attributes for education status can contain the values-undergraduate,
postgraduate, matriculate. Both skin colour and education status are nominal attributes.
2. Binary Attributes: A binary attribute is a category of nominal attributes that
contains only two classes: 0 or 1, where 0 often tells that the attribute is not present and
1 tells that it is present. Binary attributes are mentioned as Boolean if the two conditions
agree to true and false.
Example: Given the attribute drinker narrate a patient item, 1 specify that the drinker
drinks, while 0 specify that the patient does not. Similarly, suppose the patient undergoes
a medical test that has two practicable outcomes.
Types of Attributes (Contd…)
Example: Height, weight, and temperature have real values. Real values can only
be represented and measured using finite number of digits. Continuous attributes
are typically represented as floating-point variables.
Examples of Data Types
Descriptive Statistics in R
• In Descriptive analysis, we are describing our data with the help of various
representative methods like using charts, graphs, tables, excel files, etc.
• Most of the time it is performed on small datasets and this analysis helps us a
lot to predict some future trends based on the current findings. Some
measures that are used to describe a dataset are measures of central tendency
and measures of variability or dispersion.
Process of Descriptive Analysis
• Descriptive Analysis helps us to understand our data and is a very important part of
Machine Learning. This is due to Machine Learning being all about making predictions.
On the other hand, statistics is all about drawing conclusions from data, which is a
necessary initial step for Machine Learning. Let's do this descriptive analysis in R.
Descriptive Analysis in R
Descriptive analyses consist of describing simply the data using some summary statistics
and graphics. Here, we'll describe how to compute summary statistics using R software.
Before doing any computation, first, we need to prepare our data, save our data in external bet
or .csv files and it's a best practice to save the file in the current directory.
R Functions for Computing Descriptive Analysis
R Functions for Computing Descriptive Analysis (Contd…)
Mean (x )= / n
2. Median: It is the middle value of the dataset. It splits the data into two
halves. If the number of elements in the dataset is odd, then the center element
is median and if it is even then the median would be the average of two
central elements.
R Functions for Computing Descriptive Analysis (Contd…)
3. Mode: It is the value that has the highest frequency in the given dataset. The
dataset may have no mode if the frequency of all data points is the same. Also, we
can have more than one mode if we encounter two or more data points having the
same frequency.
4. Range: The range describes the difference between the largest and smallest
data point in our dataset. The bigger the range, the more is the spread of data and
vice versa.
• Finding out the important variables that can be used in our problem.
• Searching for the answers by using visualisation, transformation, and modeling of our data.
• Using the lessons that we learn to refine our set of questions or to generate a new set of questions.
Exploratory Data Analysis in R
For Descriptive Statistics to perform EDA in R, we will divide all the functions into the
following categories:
• Measures of dispersion
• Correlation
Graphical Method in EDA
Since we have already checked our data for missing values, blatant errors, and
typos, we can now examine our data graphically to perform EDA. We will see the
graphical representation under the following categories:
• Distributions
• The practice can also help businesses identify which factors affect customer
behaviour; pinpoint areas that need to be improved or need more attention;
make data more memorable for stakeholders; understand when and where
to place specific products; and predict sales volumes.
Benefits of Data Visualisation
• The ability to absorb information quickly, improve insights and make faster
decisions;
• An increased understanding of the next steps that must be taken to improve the
organisation;
• An improved ability to maintain the audience's interest with information they can
understand;
• An easy distribution of information that increases the opportunity to share insights
with everyone involved;
• Eliminate the need for data scientists since data is more accessible and
understandable; and
• An increased ability to act on findings quickly and, therefore, achieve success with
greater speed and less mistakes.
R Data Visualisation
1. plotly: The plotly package provides online interactive and quality graphs. This package
extends upon the JavaScript library ?plotly.js.
2. ggplot2: R allows us to create graphics declaratively. R provides the ggplot package for
this purpose. This package is famous for its elegant and quality graphs, which sets it apart
from other visualisation packages.
3. tidyquant: The tidyquant is a financial package that is used for carrying out quantitative
financial analysis. This package adds under tidyverse universe as a financial package that is
used for importing, analysing, and visualizing the data.
R Data Visualisation (Contd…)
4. taucharts: Data plays an important role in taucharts. The library provides a declarative
interface for rapid mapping of data fields to visual properties.
5. ggiraph: It is a tool that allows us to create dynamic ggplot graphs. This package allows
us to add tooltips, JavaScript actions, and animations to the graphics.
7. googleVis: googleVis provides an interface between R and Google's charts tools. With
the help of this package, we can create web pages with interactive charts based on R data
frames.
R Data Visualisation (Contd…)
8. RColourBrewer: This package provides colour schemes for maps and other
graphics, which are designed by Cynthia Brewer.
10. shiny: R allows us to develop interactive and aesthetically pleasing web apps
by providing a shiny package. This package provides various extensions with
HTML widgets, CSS, and JavaScript.
R Data Visualisation (Contd…)
Data Visualisation in R
Data visualisation is the technique used to deliver insights in data using visual cues such as
graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding it.
• R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualisation as it offers flexibility and
minimum required.
Types of Data Visualisations
1. Bar Plot : There are two types of bar plots-horizontal and vertical which
represent data points as horizontal or vertical bars of certain lengths
proportional to the value of the data item. They are generally used for
continuous and categorical variable plotting. By setting the horiz parameter to
true and false, we can get horizontal and vertical bar plots, respectively.
2. Histogram: It is like a bar chart as it uses bars of varying height to represent
data distribution. However, in a histogram values are grouped into consecutive
intervals called bins. In a Histogram, continuous values are grouped and
displayed in these bins whose size can be varied.
3. Box Plot: The statistical summary of the given data is presented graphically
using a boxplot. A boxplot depicts information like the minimum and
maximum data point, the median value, first and third quartile, and
interquartile range.
Analytics for Unstructured Data
• Unstructured data is data that doesn't have a fixed form or structure. Images,
videos, audio files, text files, social media data, geospatial data, data from
IoT devices, and surveiliance data are examples of unstructured data. About
80%-90% of data is unstructured. Businesses process and analyse unstructured
data for different purposes, like improving operations and increasing revenue.
Unstructured data analysis is complex and requires specialised techniques,
unlike structured data, which is straightforward to store and analyse.
• Here is a quick glance at all the unstructured data analysis techniques and tips
as:
1. Keep the business objective(s) in mind
2. Define metadata for faster data access
3. Choose the right analytics techniques
4. Qualitative data analysis techniques
Analytics for Unstructured Data (Contd…)
Unstructured data analysis has a potential to generate huge business insights. However,
traditional storage and analysis techniques are not sufficient to handle unstructured data.
Here are some of the challenges that companies face in analysing unstructured data:
3. Data protection
5. Data migration
6. Cognitive bias
Unstructured Data Analytics Tools
• Unstructured data analytics tools use machine learning to gather and analyse data that has
no pre-defined framework-like human language. Natural language processing (NLP)
allows software to understand and analyse text for deep insights, much as a human would.
• Unstructured data analysis can help your business answer more than just the "What is
happening?" of numbers and statistics and go into qualitative results to understand, "Why
is this happening?“
• Monkey Learn is a SaaS platform with powerful text analysis tools to pull real-world and
real-time insights from your unstructured information, whether it's public data from the
internet, communications between your company and your customers, or almost any other
source.
Unstructured Data Analytics Tools (Contd…)
• Among the most common and most useful tools for unstructured data analysis is:
2. Keyword extraction to pull the most used and most important keywords from
text: find recurring themes and summarise whole pages of text.
Data visualization is the graphical Data analytics is the process of analyzing data sets in order to make
Definition representation of information and data in a decision about the information they have, increasingly with
pictorial or graphical format. specialized software and system.
Data Visualization technologies and Data Analytics technologies and techniques are
Industries techniques are widely used in Finance, widely used in Commercial, Finance, Healthcare,
Banking, Healthcare, Retailing etc Crime detection, Travel agencies etc
Big data processing, Service management Big data processing, Data mining, Analysis and
Platforms
dashboards, Analysis and design. design
Use of Machine Data Science makes use of machine Data Analytics does not use machine
Learning learning algorithms to get insights. learning to get the insight of data.
Data science deals with explorations Data Analysis makes use of existing
Goals
and new innovations. resources.
Data Science mostly deals with Data Analytics deals with structured
Data Type
unstructured data. data.
Statistical skills are necessary in the The statistical skills are of minimal or
Statistical Skills
field of Data Science.. no use in data analytics.
Differentiation between NoSQL and RDBMS
Aspect NoSQL RDBMS
Structure Flexible schema; key-value, document, Rigid schema; data stored in tables with rows and
graph, or wide-column formats. columns.
Query No standard language (e.g., JSON, CQL, Uses SQL (Structured Query Language).
Language APIs).
Scalability Horizontally scalable (adding more Vertically scalable (upgrading hardware).
servers).
Optimized for large-scale, high- Efficient for complex queries and transactions.
Performance throughput operations.
Data Typically, eventual consistency (CAP Strong consistency due to ACID compliance.
Consistency theorem trade-offs).
• Data Integration: Aggregating data from various sources into a unified format.
• Query Optimization: Designed for efficient data retrieval using SQL and OLAP
(Online Analytical Processing).
Data mining task primitives are the basic building blocks or operations used in
the data mining process to define and execute tasks for extracting patterns,
insights, or knowledge from data. These primitives help specify what type of
data mining task is being performed and the parameters involved.
Purpose of Task Primitives
• These primitives serve as a framework to guide the data mining process,
ensuring that:
• Goals are clearly defined.
• Techniques are appropriately applied.
• Results are interpretable and actionable.
Cloud Computing
Cloud Computing refers to the delivery of computing services—including
servers, storage, databases, networking, software, analytics, and intelligence—
over the internet ("the cloud"). It allows users to access and use resources on
demand without needing to own or manage physical hardware or infrastructure.
Advantages of Cloud Computing