0% found this document useful (0 votes)

8 views179 pages

Data Visualization

The document discusses data extraction, integration, reduction, transformation, and visualization, highlighting their importance in deriving insights from data. It outlines various techniques and methods for each process, including association, classification, clustering, and regression for data extraction, as well as different architectures for data integration. Additionally, it emphasizes the role of data visualization in making complex data more understandable and actionable for decision-making.

Uploaded by

Shivam Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views179 pages

Data Visualization

Uploaded by

Shivam Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 179

UNIVERSITY INSTITUTE OF

ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Introduction to Data Extraction DISCOVER . LEARN .
EMPOWER
Data Extraction:

Data extraction is the first and perhaps most important step of the Extract/Transform/Load (ETL)
process. Through properly extracted data, organizations can gain valuable insights, make informed
decisions, and drive efficiency within all workflows.
Data extraction is crucial for almost all organizations since there are multiple different sources
generating large amounts of unstructured data. Therefore, if the right data extraction techniques are not
applied, organizations not only miss out on opportunities but also end up wasting valuable time, money,
and resources.

Techniques for Data Extraction

Data extraction can be divided into four techniques. The selection of which technique is to be used is
done primarily based on the type of data source. The four data extraction techniques are:

1.Association
2.Classification
3.Clustering
4.Regression
1. Association
Association data extraction technique operates and extracts data based on the
relationships and patterns between items in a dataset. It works by identifying
frequently occurring combinations of items within a dataset. These relationships, in
turn, help create patterns in the data.
Furthermore, this method uses “support” and “confidence” parameters to identify
patterns within the dataset and make it easier for extraction. The most frequent use
cases for association techniques would be invoices or receipts data extraction.
2. Classification
Classification-based data extraction techniques are the most widely accepted, easiest,
and efficient methods of data extraction. In this technique, data is categorized into
predefined classes or labels with the help of predictive algorithms. Based on this
labelled data, models are created and trained for classification-based extraction.
A common use case for classification-based data extraction techniques would be in
managing digital mortgage or banking systems.

3. Clustering
Clustering data extraction techniques apply algorithms to group similar data points
into clusters based on their characteristics. This is an unsupervised learning technique
and does not require prior labelling of the data.
Clustering is often used as a prerequisite for other data extraction algorithms to
function properly. The most common use case for clustering is when extracting visual
data, from images or posts, where there can be many similarities and differences
4. Regression
Each dataset consists of data with different variables. Regression data extraction
techniques are used to model relationships between one or more independent
variables and a dependent variable.
Regressive data extraction applies different sets of values or “continuous values” that
define the variables of the entities associated with the data. Most commonly,
organizations use regression data extraction for identifying dependent and
independent variables with datasets.

4
Types of Data Extraction
Organizations use multiple different types of data extraction such as Manual,
Traditional OCR-based, Web scraping, etc. Each data extraction method uses a
particular data extraction technique that we read earlier.

1. Manual data extraction

As the name suggests, manual data extraction method involves the collection of data
manually from different data sources and storing it in a single location. This data
collection is done without the help of any software or tools.
Although manual data extraction is extremely time-consuming and prone to errors, it
is still widely used across businesses.

2. Web Scraping
Web scraping refers to the extraction of data from a website. This data is then
exported and collected in a format more useful for the user, be it a spreadsheet or an
API. Although web scraping can be done manually, in most cases it is done with the
help of automated bots or crawlers as they can be less costly and work faster.
However, in most cases, web scraping is not a straightforward task. Websites come in
many different formats and can have challenges such as captchas, etc. to avoid as
well.

5
3. OCR-based data extraction
Optical Character Recognition or OCR refers to the extraction of data from printed or
written text, scanned documents, or images containing text and converting it into
machine-readable format. OCR-based data extraction methods require little to no
manual intervention and have a wide variety of uses across industries.
OCR tools work by preprocessing the image or scanned document and then identifying
the individual character or symbol by using pattern matching or feature recognition.
With the help of deep learning, OCR tools today can read 97% of the text correctly
regardless of the font or size and can also extract data from unstructured documents.

4. Template-based data extraction

Template-based data extraction relies on the use of pre-defined templates to extract
data from a particular data set the format for which largely remains the same. For
example, when an AP department needs to process multiple invoices of the same
format, template-based data extraction may be used since the data that needs to be
extracted will largely remain the same across invoices.
This method of data extraction is extremely accurate as long as the format remains
the same. The problem arises when there are changes in the format of the data set.
This can cause issues in template-based data extraction and may require manual
intervention.

6
5. AI-enabled data extraction
AI-enabled data extraction technique is the most efficient way to extract data while
reducing errors. This automates the entire extraction process requiring little to no
manual intervention while also reducing the time and resources invested in this
process.
AI-based document processing utilizes intelligent data interpretation to understand the
context of the data before extracting it. It also cleans up noisy data, removes
irrelevant information, and converts data into a suitable format. AI in data extraction
largely refers to the use of Machine Learning (ML), Natural Language Processing (NLP),
and Optical Character Recognition (OCR) technologies to extract and process the data.
Automate manual data entry using Nanonet’s AI-based OCR software. Capture data
from documents instantly. Reduce turnaround times and eliminate manual effort.

6. API Integration
API integration is one of the most efficient methods of extracting and transferring
large amounts of data. An API enables fast and smooth extraction of data from
different types of data sources and consolidation of the extracted data in a centralized
system.
One of the biggest advantages of API is that the integration can be done between
almost any type of data system and the extracted data can be used for multiple
different activities such as analysis, generating insights, or creating reports.
7
7. Text pattern matching
Text pattern matching or text extraction refers to the finding and retrieving of specific
patterns within a given data set. A specific sequence of characters or patterns needs
to be predefined which will then be searched for within the provided data set.
This data extraction type is useful for validating data by finding specific keywords,
phrases, or patterns within a document.

8. Database querying
Database querying is the process of requesting and retrieving specific information or
data from a database management system (DBMS) using a query language. It allows
users to interact with databases to extract, manipulate, and analyse data based on
their specific needs.
Structured query language (SQL) is the most commonly used query language for
relational databases. Users can specify criteria, such as conditions, and filters, to fetch
specific records from the database. Database querying is essential for making
informed decisions and building data-driven businesses.

8
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=E7oACf4a24Y&pp=ygUaZGF0YSBleHRyYWN0aW9u
IGluIGVuZ2xpc2g%3D

9
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Data Integration, Data Reduction, Data Transformation DISCOVER . LEARN .
EMPOWER
Data Integration:
Data integration is the process of combining data to derive meaningful information
and business insights. This plays a major role in each and every organisation as the
right information at the right time, leads to right decisions which help improve the
customer journey. In summary, data integration is a key to aligning the information
system with the correct business strategy. Let’s discuss some basic concepts of this
area along with several design patterns and integration mechanisms.

12
Data Integration Architecture Types

1.Batch Integration Architecture

Data is collected and processed in batches at scheduled intervals. It involves
extracting data from multiple sources, transforming it, and loading it into a target
system or data warehouse. Batch integration architecture is suitable for scenarios
where near-real-time data processing is not required, and data can be processed in
regular intervals, such as overnight or at specific time intervals.

13
2. Real-time Integration Architecture
Real-time integration architecture enables the continuous flow of data in near-real-
time or real-time from source systems to target systems. It involves capturing data
changes as they occur and immediately propagating them to the target systems.
This architecture is commonly used in scenarios where immediate data availability
and responsiveness are critical, such as online transaction processing (OLTP)
systems or real-time analytics.

3. Message-Oriented Middleware (MOM) Architecture

MOM leverages message queues or message brokers to enable data integration.
Data is sent as messages from source systems to target systems through the
middleware. This architecture provides asynchronous and reliable communication
between systems, allowing decoupling and scalability. MOM architecture is
commonly used in distributed and event-driven systems.

14
4. Extract, Load, Transform (ELT) Architecture
ELT involves extracting data from source systems, loading it into a target system or
data lake, and then performing transformations directly within the target system. ELT
leverages the processing power and scalability of the target system to perform
complex transformations, as opposed to traditional extract, transform, load (ETL)
architectures where transformations happen before loading the data. ELT
architecture is well-suited for scenarios with large volumes of data and where the
target system can handle the transformation processes efficiently.

5. Federated Integration Architecture

Data remains distributed across multiple systems, and integration occurs virtually by
querying and accessing the data in real time without physically moving or replicating
it. This architecture provides a unified view of data without the need for centralizing
or consolidating it. Federated integration architecture is useful in scenarios where
data resides in multiple autonomous systems or when data sovereignty and privacy
regulations need to be adhered to.

6. Hybrid Integration Architecture

Hybrid integration architecture combines multiple integration approaches to cater to
different data integration needs within an organisation. It involves a mix of batch,
real-time, and other integration patterns to accommodate a wide range of data
sources, processing requirements, and latency constraints. Hybrid integration 15
Data Reduction:
With great amounts of data comes the greater need to process data accurately. And
in this case, analysis with tons of data onboard can be a difficult task to deal with.
Therefore, such techniques are employed in data preprocessing in data mining to get
the required results and can be done so in the following ways.

16
1.Data Cube Aggregation:
A data cube is constructed using the operation of data aggregation.

2.Attribute Subset Selection:

Using only attributes that are highly relevant is usually the correct way to deal with
things. Unnecessary data can always be discarded. In attribute selection, a level can
be decided and anything that may be of lesser significance can be discarded.

3.Numerosity Reduction:
in this case, data preprocessing only stores model data and throws away
unnecessary data.

4.Dimensionality Reduction:
using various encoding mechanisms, the size of the data can be reduced. Depending
on how it’s done, one may or may not lose data. If after reduction, one is able to
successfully retrieve reduced data, then it is considered lossless. If otherwise, then
the data is lost for good.

17
Data Transformation
Data transformation plays a crucial role in data management. This process reshapes
data into formats that are more conducive to analysis, unlocking its potential to
inform and guide strategic decision-making. It encompasses a spectrum of
techniques such as cleaning, aggregating, restructuring, and enriching, each
designed to refine data into a more usable and valuable asset.
As organizations increasingly rely on data-driven strategies for growth and efficiency,
understanding and mastering data transformation becomes essential. This guide
delves into the intricacies of data transformation, exploring its role, methodologies,
and impact on the overall data integration process.

Types of Data Transformations

1.Structuring Data: This involves organizing unstructured or semi-structured data

into a structured format. It’s about converting data into a form that is easy to store,
query, and analyze, like transforming text data into a tabular format.

2.Cleaning and Validation: Data cleaning is about removing errors and

inconsistencies from data. Validation ensures the data meets certain standards or
criteria. This step is crucial for maintaining data accuracy and reliability.

3.Aggregation: Aggregating data involves combining data from multiple sources or 18

4. Enrichment: Data enrichment involves adding extra information to existing data
to enhance its value. This could include adding demographic details to customer
data or appending geographic information to sales data.

5. Time-Based Transformations: These are transformations that involve time-

specific operations, like converting time zones, aggregating data into time periods
(e.g., daily, monthly), or creating time-based calculations.

19
Tools and Technologies for Data Transformation
Data transformation tools are diverse, each designed to address specific aspects of
data transformation. These tools can be broadly categorised as follows:

1.ETL (Extract, Transform, Load) Tools: These tools are fundamental in data
warehousing. They extract data from various sources, transform it into a suitable
format or structure, and then load it into a target system or database.

2.Data Cleaning Tools: Focused on improving data quality, these tools help in
identifying and correcting errors and inconsistencies in data.

3.Data Integration Platforms: These platforms provide a comprehensive solution

for combining data from disparate sources, often including built-in transformation
capabilities.

4.Scripting Languages: Languages like Python and R, though not exclusively

transformation tools, are often used for complex data transformations due to their
powerful libraries and flexibility.

5.Cloud-based Data Transformation Services: These services offer scalable and

flexible transformation capabilities, often with pay-as-you-go pricing models, making
them suitable for businesses with variable data transformation needs. 20
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=Kq4QgbhkqyE&pp=ygUuZGF0YSBpbnRlZ3JhdGlvbi
BpbiBkYXRhIHZpc3VhbGlzYXRpb24gZW5nbGlzaA%3D%3D

21
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Role of Visualization in Data Processing DISCOVER . LEARN .
EMPOWER
Data Visualization:

Data visualization is a way of turning data into information. It’s a visual

representation of numbers and other data that helps us understand the presented
information. The goal is to make it simpler and more accessible to understand
complex data. It can be used in different formats, including charts, graphs, or
diagrams.

24
Data visualization helps us make sense of data by simplifying it and presenting it
in a clear way that is easy to understand. It can represent anything from complex
economic trends to simple relationships between different variables.
Data visualization can be done on paper or computer screens using software
programs such as Excel or Tableau; however, sometimes, it’s done using physical
objects such as maps or models. It helps us see patterns and relationships
between different pieces of information, which can help us make better and more
efficient decisions about our lives and businesses — whether looking at real
estate prices or trying to find out what your customers want most from your
company.

25
Importance of Data Visualizations

1. Visuals Help Learn More Effectively

Visuals help you understand and remember more. This is because the human
brain can process visuals 60,000 faster than text. Visuals help you learn more
quickly by making information easier to see and comprehend at a glance than
text alone can provide.
Visuals help you learn more deeply by providing context that adds meaning to
your understanding of numbers (and vice versa). For example, if I tell you that my
income was $60k last year but leave it at that — without showing any visual
representations — then each dollar amount could mean something different
depending on what kind of lifestyle I am living (or not).
However, if I show how much money we spent on groceries per month ($300)
compared with how many nights we went out for dinner ($200), then suddenly
those numbers make sense because they’re tied back into real-world examples
which provide context around them. This helps build understanding because now
it’s easy to visualize what happened when all those numbers were collected
together during one year.
Visuals also allow us to process information efficiently while still retaining
accuracy thanks to interactivity with other people without having anyone miss
anything important from either side due to both listening/talking simultaneously.
This makes communicating faster & easier overall since everyone involved knows 26
2. Data Visualization Makes It Easier to Comprehend and Remember
Information
Data visualization is a great way to process and understand data. By visually
representing your data, you can see patterns that might otherwise be difficult to
discern from a table full of numbers. Additionally, the more time you spend with
visualizations, the more likely you’ll remember the information they display — and
even more important facts that are not explicitly shown in the visualization but
can nevertheless be inferred from it.
Finally, data visualizations can help you make sense of complex relationships
between variables. For example, if one variable changes over time while another
doesn’t change (or vice versa), this could indicate some relationship between
them — but only if those changes are visible in the same graphic or chart.

3. Effective Data Visualization Is a Superpower of Any Business

Data Visualization is a superpower of any business. Data Visualization can help
businesses to make better decisions, get more customers and save money. Let’s
take a look at how Data Visualization can be used to the benefit of your business:
•Get Better Decisions — With the help of data visualization tools, you can track
various aspects of your business and find out what is working well or not so well.
This way, you will be able to make better decisions regarding improving it further
by taking necessary actions based on this information.
27
•Get More Customers — By providing easy access to information through
visualizations, businesses can attract more customers as they would feel that
their question was answered quickly without having to go through pages and
pages of data before getting an answer. This way, they become more loyal
towards the products/services offered by these businesses since they feel valued
after using them because they were able to get answers in less time than
expected. It also makes things easier when comparing different options available,
making it easier for people who want something similar but don’t know where
else.

4. If a Picture Verifies the Story, It Sells

Data visualization is an effective way to communicate with audiences. This is
because the human brain, as mentioned before, can process visual information
60,000 faster than text. This makes it easier to understand data, which is why
visuals make great communication tools.
Visuals also help us remember the information they present better than plain text
or even tables of numbers (unless you’re a stats geek).
Visuals are a powerful tool that can tell stories and compare two sets of data
simultaneously.

5. Data Visualization Helps Explain Cause-And-Effect Relationships

Data visualization is an essential tool for understanding your data. However, 28
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=MiiANxRHSv4&pp=ygU2cm9sZSBvZiB2aXN1YWxp
emF0aW9uIGluIGRhdGEgdmlzdWFsaXphdGlvbiBpbiBlbmdsaXNo

29
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Overview of Basic Charts and Plots DISCOVER . LEARN .
EMPOWER
Basic Charts and Plots in Data Visualization:
Data visualisation is a powerful tool that helps us understand and communicate
complex data effectively.

Scatter Plots:
Scatter plots are used to display the relationship between two continuous
variables. Each data point is represented by a dot on the plot, and the position of
the dot corresponds to the values of the variables being compared. Scatter plots
are ideal for identifying trends, clusters, and outliers in the data.

32
Line Charts:
Line charts are commonly used to show trends and changes over time. They
connect data points with lines, making it easy to visualize the progression of a
variable. Line charts are effective for illustrating patterns, seasonality, and
comparing multiple time series.

33
Bar Charts:
Bar charts are widely used to compare categories or discrete variables. The length
of each bar represents the value of the variable being displayed, and the bars can
be arranged vertically or horizontally. Bar charts are ideal for visualizing
frequency, distribution, and comparisons between different groups or categories.

34
Histograms:
Histograms provide a visual representation of the distribution of a continuous
variable. They group data into bins or intervals, and the height of each bar
represents the frequency or count of observations within that bin. Histograms are
useful for understanding the shape, central tendency, and spread of data.

35
Pie Charts:
Pie charts represent the proportion or percentage of different categories within a
whole. They are circular in shape, with each category represented by a slice of the
pie. Pie charts are suitable for showcasing the contribution of each category to
the whole and making comparisons between different parts.

36
Box Plots:
Box plots, also known as box-and-whisker plots, provide a visual summary of the
distribution of a continuous variable. They display the minimum, maximum,
median, and quartiles of the data. Box plots are useful for identifying outliers,
comparing distributions, and understanding the spread and skewness of the data.

37
Heatmaps:
Heatmaps use color gradients to represent the magnitude or intensity of values in
a matrix or two-dimensional dataset. They are particularly effective for visualising
large datasets and identifying patterns or clusters. Heatmaps are commonly used
in areas such as genomics, finance, and data analysis.

38
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=xVWKPSIBDIQ&pp=ygUyY2hhcnRzIGFuZ
CBwbG90cyBpbiB2aXN1YWxpemF0aW9uIGluIGVuZ2xpc2ggbnB0ZWw%3D

39
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Data Cleaning DISCOVER . LEARN .
EMPOWER
Data Cleaning:
Data cleaning, sometimes referred to as data cleansing or data scrubbing, is the
process of locating and fixing errors or inaccuracies in data. These errors could be
the result of a number of factors, including incorrect data entry, missing numbers,
duplicates, outliers, and more. Data cleaning helps ensure that the information
within the dataset is correct and precise. Inaccuracies can stem from human
error, system glitches, or faulty measuring instruments. The ultimate goal of data
cleaning is to make sure the data is accurate, full, consistent, and relevant. Data
preprocessing is sometimes called data cleaning, but data preprocessing involves
more steps than just cleaning the data. So, data cleaning is a subset of data
preprocessing.

42
Importance of Data Cleaning
Clean data is fundamental for robust analysis and accurate modelling in data
science. If data is not adequately cleaned and prepared before analysis, several
significant problems can arise, impacting the reliability and accuracy of any
subsequent analysis or decision-making processes. Data that hasn’t been cleaned
appropriately can lead to questionable results. This can reduce the confidence
that stakeholders have in the analysis and its implications, potentially causing
delays in decision-making. Here are some reasons highlighting the importance of
data cleaning:

43
Importance of Data Cleaning

1. Enhanced Data Quality:

Clean data ensures that machine learning models and algorithms perform
optimally, leading to better predictive accuracy. Data cleaning makes sure the
data is accurate, dependable, and consistent. The overall quality of the dataset
improves by locating and fixing errors and discrepancies.

2. Improved Accuracy of Analysis:

Uncleaned data often contains errors, inconsistencies, and inaccuracies, leading
to incorrect or misleading insights. These inaccuracies can significantly skew
analytical results and misguide decision-making. Clean, accurate data leads to
more precise and trustworthy analysis. Reliable and accurate data generates
trustworthy insights, forming a solid foundation for decision-making processes
and drawing meaningful insights from the data.

3. Effective Decision-Making:
Analyzing and presenting unclean data can reduce the credibility of the analysis
and the individuals or organizations responsible for it. Stakeholders may lose trust
in the accuracy and reliability of the findings. Decision-making is supported by
trustworthy data that is free of biases and inaccuracies. Managers and
stakeholders can confidently rely on the results of analyses based on clean data. 44
4. Avoidance of biased insights:
Inaccuracies in data can introduce biases that skew analysis and conclusions.
Data containing biases, such as imbalances, duplicate records, or incomplete
entries, can result in biased conclusions that may favor a particular group or
perspective. This can be detrimental in various domains, including business,
healthcare, finance, and the social sciences. Data cleaning helps mitigate these
biases and ensures that the insights drawn are representative and unbiased.

5. Increased Productivity and Efficiency:

Working with inaccurate data can be time- and resource-consuming. Analysing
raw, uncleaned data can slow down the analysis process and introduce
complexities. If data-cleaning tasks are put off until later phases, they may take
longer and be less productive. Data cleaning streamlines the analysis process by
removing unnecessary obstacles, allowing analysts to focus on deriving insights
rather than correcting errors.

6. Better Data Integration:

Data frequently comes from various sources. When data is not properly
standardized and cleaned, integrating it with other datasets can become
challenging. Merging inconsistent or incompatible data can result in further data
quality issues and hinder the overall analysis. Data cleaning facilitates smooth
integration by aligning data formats, resolving inconsistencies, and ensuring 45
7. Compliance and Reporting:
Incomplete or inconsistent data can misrepresent trends or patterns, causing a
misinterpretation of the underlying phenomena. This misrepresentation can lead
to erroneous assumptions and flawed strategies. In regulated industries or
contexts where data compliance is critical, having clean and accurate data is
essential for meeting regulatory requirements and producing accurate reports.

8. Cost Savings:
Data cleaning reduces unnecessary costs associated with erroneous data, such as
marketing to incorrect addresses or targeting the wrong audience. Analysing
unclean data can lead to wasted resources and time spent on investigating false
leads, correcting errors, and redoing analyses. In terms of both time and money,
this can be expensive. Correcting errors early in the data lifecycle is typically
more cost-effective than addressing issues later, especially after analysis or when
errors have propagated throughout the organization.

46
Data cleaning :

Data cleaning plays a foundational role in data analysis, enabling reliable

insights, informed decision-making, and overall better utilization of data for
organizational success. Failing to clean data can jeopardize the integrity and
reliability of analytical outcomes, hinder informed decision-making, waste
resources, and negatively impact an organization’s credibility. Proper data
cleaning is a critical step in the data analysis workflow to ensure accurate and
trustworthy results.

Common Data Cleaning Techniques

It is impossible to exaggerate the significance of reliable, high-quality data in the
age of big data and sophisticated analytics. Data, in its raw form, often contains
imperfections, inaccuracies, and inconsistencies. These issues can adversely
affect the results of any analysis or modelling efforts, ultimately leading to
incorrect conclusions and flawed decision-making. Let’s look at several popular
data cleaning methods that enable data scientists to turn unstructured data into
useful assets:

47
Data Cleaning Techniques

1. Handling Missing Values

Datasets frequently have missing values. Data scientists utilize various
approaches to handle them, including imputation (filling missing values with
statistical measures like mean, median, or mode), deletion of rows or columns
with missing values, or using advanced algorithms to predict missing values
based on the available data.
· Deletion of Rows or Columns: Remove rows or columns with missing values. This
is suitable if the amount of missing data is very small and won’t significantly
affect the analysis.
· Imputation: The mean (average), median (middle value), or mode (most
frequent value) of the non-missing values in the column should be used to fill in
any missing values.
· K-Nearest Neighbors (K-NN) Imputation: Fill in missing values using the values of
the k-nearest neighbors in the feature space. Based on the average of the values
from the nearest neighbors, the missing values are imputed.

48
2. Outlier Detection and Treatment
Data points that deviate significantly from other observations are known as
outliers. Analysis and models can be distorted by outliers. Techniques like the Z-
score method or the IQR (interquartile range) method help identify and handle
outliers appropriately, ensuring they don’t skew the analysis. Managing outliers is
crucial for robust analysis.
· Visual Inspection: Calculate summary statistics (e.g., mean, median, standard
deviation, quartiles) to understand the central tendency and spread of the data.
Unusually large or small values could indicate outliers. To visualize and spot
outliers, use histograms, box plots, or scatter plots.
· Statistical Methods: Employ statistical techniques like Z-score or IQR
(interquartile range) to detect and handle outliers, either by removal or
transformation. Calculate the Z-score for each data point. Data points with a Z-
score beyond a threshold (e.g., 3) are considered outliers. The difference between
the first quartile (Q1) and the third quartile (Q3) is known as the IQR. Outliers fall
outside the range defined by Q1–1.5 * IQR and Q3 + 1.5 * IQR.

49
3. Data Transformation and Standardization

Data transformation includes converting data into a format that is acceptable for
analysis. This may include scaling features, encoding categorical variables, or
creating new features that are more informative for the intended analysis.
Standardizing data ensures uniformity and consistency, simplifying subsequent
analysis.
· Scaling: Scale numerical features to a specific range (e.g., [0, 1] or [-1, 1]) to
provide equal importance to all features.
· Normalization and standardization: Normalization scales the features between 0
and 1, while standardization transforms the data to have a mean of 0 and a
standard deviation of 1, making it easier to compare and analyze.

4. Removing Duplicates

Duplicate data can lead to misleading analyses. Data deduplication techniques

identify and remove duplicate records, ensuring that each data point is unique
and contributes meaningfully to the analysis. Duplicate records can bias results
and skew analyses. Here’s how to address duplicates:
· Identify Duplicate Rows: Compare each row in the dataset to determine if it is a
duplicate of another row. Duplicates can be found by looking at particular columns
or an entire row. 50
· Remove Duplicates: Keep only the first occurrence or a randomly selected
5. Text Data Cleaning

For text-based data, techniques like removing stop words, stemming,

lemmatization, and handling special characters are crucial to prepare the text for
analysis, sentiment analysis, or natural language processing. Text data often
requires specialized cleaning techniques.

· Tokenization: Break the text into smaller units, such as words or sentences.
Tokenization aids in text organization for additional processing.
· Removing special characters: Eliminate special characters, symbols, and non-
alphanumeric characters that don’t contribute to the meaning of the text.
· Removing Stop Words: Stop words are commonly occurring words in a language
that do not carry significant meaning. Eliminate common and non-informative
words (e.g., “and,” "the," and “is”) that do not provide meaningful insights for
analysis. Removing stop words helps reduce the dimensionality of the text data
and focus on the more informative terms.
· Lemmatization and Stemming: Reduce words to their root or base form
(stemming) or transform them to their dictionary form (lemmatization). This helps
standardize variations of words.

51
6. Handling Inconsistent Data

Inconsistencies in data formatting or values can hinder analysis. To make sure the
data is accurate, dependable, and appropriate for analysis or other uses, it is
necessary to handle inconsistent data. Data input errors, differing information
formats, divergent standards, or the integration of data from numerous sources
are just a few causes of inconsistent data. Here are steps to handle inconsistent
data:
· Identify Inconsistencies: Begin by thoroughly examining the dataset to identify
inconsistencies, including discrepancies, errors, or anomalies in the data.
· Data Formatting: Standardize date formats, capitalization, and other conventions
to maintain uniformity. Ensure that data follows consistent formats for dates,
times, currency, units, and other relevant formats. Convert inconsistent data to
the desired format.
· Data Validation Rules: Define and apply validation rules to ensure data adheres
to predefined patterns or constraints. Implement validation rules to ensure that
the data adheres to predefined criteria, such as a range of valid values or required
fields.

52
7. Feature Engineering and Selection

In order to create powerful machine learning models, feature engineering and

selection are essential tasks. To enhance model performance and efficiency, these
steps entail generating new features from current data and choosing the most
significant features. Effective feature engineering can improve the quality of data
for analysis and modeling.
· Creating New Features: Generate new features based on existing ones,
potentially enhancing the predictive power of the dataset. Transform existing
features using mathematical operations (e.g., logarithm, square root) to make
them more suitable for the model. Convert continuous variables into categorical
bins to capture patterns that might be missed otherwise. Encode categorical
variables into binary values (0s and 1s) to make them compatible with machine
learning algorithms.

· Feature Selection: Choose the most relevant features to reduce noise and
improve model performance. To determine the most influential features, compute
correlations between the feature and the target variable. To rank features
according to how relevant they are to the goal variable, use statistical tests or
information gathering. Utilize methods like principal component analysis (PCA) or
singular value decomposition (SVD) to decrease the feature space’s
dimensionality while preserving the most crucial data. 53
8. Addressing Class Imbalance (for Classification Tasks)
Class imbalance in a classification task occurs when the number of instances in
each class significantly differs, potentially leading to biased models that favor the
majority class. Addressing this imbalance is crucial to creating models that can
make accurate predictions for all classes. For datasets that are unbalanced, with
one class greatly outnumbering another:
· Oversampling: Duplication or synthetic generation can be used to increase the
proportion of occurrences in the minority class. To balance the distribution of
classes, duplicate instances from the minority class at random. Create artificial
data points by using methods such as SMOTE (Synthetic Minority Over-sampling
Technique).
· Undersampling: To establish balance, lower the number of instances in the
majority class. To restore balance to the class distribution, arbitrarily eliminate
instances from the dominant class. Be cautious not to remove too much
information, which can lead to underrepresentation of the majority class.
· Class Weighting: Assign higher weights to the minority class during model
training to give it more importance. Many classification algorithms allow for class
weights to be specified, which can help mitigate the impact of class imbalance.
· Ensemble Techniques: Use ensemble methods like bagging (e.g., Random
Forest) or boosting (e.g., AdaBoost, XGBoost) that can handle class imbalance by
combining predictions from multiple weak learners.
54
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=mwEPXevpqls&pp=ygUvZGF0YSBjbGVhbmluZyBp
biB2aXN1YWxpemF0aW9uIGluIGVuZ2xpc2ggbnB0ZWw%3D

55
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Multivariate Data Visualization and Visual Information DISCOVER . LEARN .
Processing EMPOWER
Advanced Visualization Techniques

Multivariate Data Visualization:

Multivariate analysis is an essential technique in data science and statistics,
allowing us to understand relationships between multiple variables
simultaneously. This approach is crucial for uncovering patterns, trends, and
insights that would otherwise remain hidden in univariate or bivariate analysis.
Here, we explore different methods for analysing numerical-numerical,
categorical-categorical, and numerical-categorical data combinations.

Numerical-Numerical Relationships
When dealing with numerical data, several visualization techniques help to
uncover relationships between variables.
i) Scatter Plot: Scatter plots are useful for visualizing the relationship between
two numerical variables. They can be enhanced using additional parameters such
as hue, size, and style to represent other dimensions.

58
59
ii Pair Plot: Pair plots provide a matrix of scatter plots for each pair of numerical
variables, allowing for a comprehensive view of their relationships. Including
a hue parameter can add a categorical dimension to the visualization.

60
iii) Line Plot: Line plots are particularly useful for time-series data, showing
trends over time. By grouping data by time intervals, such as years, and
summarizing it, we can observe patterns and changes.

61
· Categorical-Categorical Relationships
Understanding relationships between categorical variables can be achieved
through several techniques:
i) Crosstab: A crosstabulation (or crosstab) table displays the frequency
distribution of variables, providing a straightforward way to observe the
interaction between categorical variables.

62
ii) Heatmap: Heatmaps visually represent crosstab data, using color to indicate
the magnitude of values. This method highlights patterns and correlations
effectively.

63
iii) Cluster Map: Cluster maps extend heatmaps by applying clustering
algorithms to the rows and columns, revealing deeper structures and groupings in
the data.

64
· Numerical-Categorical Relationships
When analysing the relationship between numerical and categorical variables, the
following visualization techniques are particularly useful:
i) Bar Plot (with Confidence Intervals): Bar plots summarize the central
tendency of a numerical variable for each category of a categorical variable, often
including confidence intervals to indicate variability.

65
ii) Box Plot: Box plots display the distribution of a numerical variable across
different categories, highlighting the median, quartiles, and outliers.

66
iii) Dist plot: Dist plots (or distribution plots) are useful for comparing the
distributions of a numerical variable across different categories. They combine
histograms and kernel density plots to provide a comprehensive view.

67
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=AmNqUu_e4nQ

68
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Pixel-Oriented, Geometric, and Icon-Based DISCOVER . LEARN .
EMPOWER
Pixel-Oriented Visualization Techniques
Pixel-oriented visualization techniques are a class of methods that directly
manipulate individual pixels on a display to create visual representations of data.
These techniques are often used in real-time applications, such as video games,
simulations, and data visualization tools.

71
1. Recursive Pattern Technique:
Data values are displayed in a grid using recursive subdivision. The aim is to
group related data hierarchically, revealing patterns within sub-regions.
o The screen space is recursively divided into rectangular sub-regions.
o Each sub-region corresponds to a subset of data, organized hierarchically.
o Pixels within a sub-region are colored based on their data value.

Example
o Visualizing sales data by geographical region:
 Top-level divisions represent countries.
 Sub-regions represent states, cities, and stores.
o Patterns in sales performance can be identified across different levels.

72
2. Pixel Bar Charts
Combines traditional bar charts with pixel-based data representation. Each bar is
made up of pixels, where each pixel represents a data point.
o Bars are drawn for each category or time period.
o Each bar is filled with a sequence of pixels.
o The color of each pixel represents its data value.
•Example:
o Visualizing daily sales over months:
 Each bar represents a month.
 Pixels in the bar represent daily sales, with colours showing sales
volume.

73
3. Circle Segment Technique
A circular visualization where data is displayed in concentric segments. Each
segment represents a category, and pixels within segments encode data values.
o The circle is divided into segments like a pie chart.
o Each segment contains rows of pixels.
o The pixel colour represents the data value.
•Example:
o Visualizing air quality data:
 Segments represent different cities.
 Rows within each segment represent hourly readings.

74
4. Temporal Pixel Maps
Designed for time-series data, where pixels are arranged in chronological order to
reveal trends over time.
o Data values are arranged line-by-line (row-wise) or column-by-column.
o Pixels are coloured based on their data values.
o The chronological arrangement highlights trends and anomalies.
•Example:
o Visualizing temperature over a year:
 Rows represent days, and columns represent hours.
 Colour intensity indicates temperature.

75
Geometric Projection Visualization Techniques
Geometric projection techniques are methods for visualizing high-dimensional
data by projecting it onto lower-dimensional spaces (e.g., 2D or 3D). These
techniques help users identify patterns, clusters, and relationships in complex
datasets while preserving as much of the original structure as possible.

Types:
1. Scatterplots
o A scatterplot represents data points in a 2D or 3D space using Cartesian
coordinates.
o Each axis corresponds to a feature (variable) in the dataset.
o Scatterplots are ideal for datasets with up to three dimensions.
Example:
o Visualizing customer segmentation:
 X-axis: Age, Y-axis: Income, Colour: Spending behaviour.
o This helps in identifying clusters of customers.

76
77
2. Principal Component Analysis (PCA)
o PCA is a linear dimensionality reduction technique that transforms high-
dimensional data into a lower-dimensional space while retaining the most
variance in the data.

78
Example:
o Genome Data Analysis:
 PCA can reduce thousands of gene expression dimensions into 2-3
components, making visualization easier.

3. t-Distributed Stochastic Neighbour Embedding (t-SNE)

t-SNE is a non-linear dimensionality reduction technique focused on preserving
local relationships among data points.
Example:
o MNIST Handwritten Digits Dataset:
 Visualizing 28x28 pixel images as clusters where digits with similar
features group together.

4. Uniform Manifold Approximation and Projection (UMAP)

o UMAP is a non-linear dimensionality reduction technique like t-SNE but
optimized for speed and scalability.
o It balances local and global structure preservation better than t-SNE.
Example:
o E-commerce Product Similarity:
 Projecting multiple product features (e.g., price, ratings, reviews) into
2D to identify similar product clusters. 79
Icon-Based Visualization Techniques
Icon-based visualization techniques use graphical symbols (icons) to represent
data values or attributes. These techniques are particularly useful for visualizing
multidimensional data where each icon encodes multiple attributes, allowing
users to identify patterns, trends, and relationships at a glance.

1. Star Glyphs
Star glyphs are radial visualizations where each data attribute is represented as
an axis radiating from a central point. The value of the attribute determines the
length of the axis.
o The number of axes corresponds to the number of attributes.
o Connecting the endpoints of the axes forms a polygon (the "star").
o The shape and size of the star provide a visual summary of the data.

Example:
•Student Performance Analysis:
o Attributes: Test scores in subjects like math, science, and English.
o Visualization: Each student gets a star glyph, showing strengths and
weaknesses in different subjects.

80
2. Chernoff Faces
Chernoff faces represent data attributes using features of a human face, such as
eyes, mouth, and nose. Each feature corresponds to an attribute, with variations
encoding data values.
o Faces are highly recognizable to humans, enabling quick pattern detection.
o Changes in facial features are intuitive and noticeable.
Example:
•Customer Satisfaction Analysis:
o Attributes: Likelihood of recommending, overall satisfaction, and frequency
of use.
o Visualization: Customers with high satisfaction might have larger eyes and
smiling mouths, while dissatisfied customers may have smaller eyes and
frowning mouths.

3. Stick Figures
Stick figures are minimalistic representations of data using simple "stick" icons.
Each limb or body part corresponds to a data attribute, and its orientation, length,
or angle represents the attribute's value.
o Stick figures are compact and can encode multiple attributes
simultaneously.
Example: 81
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=f79bJTZSAqc

82
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Hierarchical Visualization Techniques DISCOVER . LEARN .
EMPOWER
Hierarchical Visualization Techniques
Hierarchical visualization techniques are used to represent data that is organized
in a hierarchy or tree structure. These techniques display relationships between
parent and child elements and make it easier to explore data with multiple levels
of detail. Such visualizations are common in fields like biology (phylogenetic
trees), computer science (file directories), and business (organizational charts).
Types:
1. Tree Diagrams
Tree diagrams are one of the most straightforward ways to represent hierarchical
data. The structure begins with a single root node and branches out into child
nodes, which in turn may have their own children. This branching creates a tree-
like structure that is highly effective for showing relationships and the hierarchy
within the data.

85
•Vertical Layout:
o Commonly used in organizational charts.
o The root is at the top, and branches flow downward.
•Horizontal Layout:
o Suitable for decision trees or processes.
o The root node is on the left, and branches flow horizontally.
•Radial Layout:
o Nodes radiate outward from a central root node.
o Often used in biological taxonomies.
Tree diagrams are ideal for scenarios where the relationships between nodes are
of primary importance, such as representing family trees or explaining logical
processes like decision-making.

86
2. Treemaps
Treemaps use nested rectangles to represent hierarchical data. The entire
visualization space is divided into rectangles, and each rectangle represents a
node in the hierarchy. The size of each rectangle corresponds to a quantitative
attribute of the node, such as its value or importance.
•Nested Structure:
o Parent nodes are larger rectangles containing smaller rectangles for their
child nodes.
•Color Coding:
o Different colors are often used to represent additional attributes like
categories or performance metrics.
•Proportional Areas:
o The area of each rectangle directly reflects the value of the node, making it
easy to compare sizes.
Treemaps are widely used in financial analysis, where they can show portfolio
performance or sales data by dividing sectors into sub-sectors.

87
88
3. Circle Packing
Circle packing visualizes hierarchical data using nested circles. The root node is
the largest circle, and it contains smaller circles representing child nodes. This
method emphasizes containment and the relative sizes of nodes.
•Parent-Child Containment:
o Parent nodes are visually represented by the largest circles, with their child
nodes nested inside them.
•Proportional Sizes:
o The size of each circle corresponds to the importance or value of the node.
•Compact Representation:
o Circle packing makes efficient use of space while providing an aesthetically
pleasing representation.
This technique is often used in environmental studies to represent ecosystems,
where larger circles might represent broader ecological categories, and smaller
circles represent species.

89
90
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=GFJF1s6hL6s&pp=ygUrSGllcmFyY2hpY2FsIFZpc3V
hbGl6YXRpb24gaW4gZW5nbGlzaCBucHRlbA%3D%3D

91
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Color Theory and Visual Variables DISCOVER . LEARN .
EMPOWER
Colour Theory:
Colour is a powerful tool that can be used to communicate information in a visual
way. In visualization, colour can be used to highlight important data points, create
visual hierarchy, and guide the viewer’s eye through the data.
Colour theory is the study of how colours work together and how they are
perceived by the human eye. There are a number of principles of colour theory
that can be applied to visualization, including:

94
Colour harmony: This refers to the way that colours are arranged together to
create a pleasing effect. There are a number of different colour harmonies that
can be used in visualization, such as complementary colours, analogous colours,
and triadic colours.
Colour contrast: This refers to the difference between two or more colours. Colour
contrast can be used to create visual interest and to highlight important data
points.
Colour symbolism: This refers to the way that colours are associated with different
meanings. For example, red is often associated with passion, blue is often
associated with calmness, and green is often associated with nature.

Colour theory can be a complex subject, but it is an important one for anyone
who wants to create effective visualizations. By understanding the principles of
colour theory, you can use colour to communicate your message more effectively
and to create visualizations that are more visually appealing.

95
Applications of Colour Theory

1.Visual Design and Art:

o Helps in creating aesthetically pleasing compositions.
o Artists and designers use it to evoke specific moods and emotions (e.g.,
warm colours for energy, cool colours for calmness).
2.Marketing and Branding:
o Companies use specific colour schemes to evoke emotions and convey
brand messages (e.g., red for excitement, blue for trust).
o Example: McDonald's uses red and yellow for energy and happiness.
3.User Interface (UI) and User Experience (UX) Design:
o Ensures clarity and accessibility in digital interfaces.
o Improves readability and navigability by using appropriate contrasts and
colour combinations.
4.Data Visualization:
o Enhances the interpretability of charts, graphs, and maps.
o Colour is used to highlight patterns, categorize data, and improve user
understanding (e.g., heatmaps use gradients to represent intensity).
5.Psychology and Therapy:
o Colours influence human emotions and behaviours.
o Colour therapy (chromotherapy) uses colours to improve mental health
(e.g., blue for relaxation, green for balance).
96
6. Film and Photography:
o Cinematographers use colour grading to set the tone of a scene (e.g., sepia
tones for nostalgia, cold tones for suspense).
o Complementary and analogous colours are used to create visual interest.
7. Fashion and Interior Design:
o Helps in creating visually appealing outfits and spaces.
o Designers use complementary and triadic schemes to balance elements.

97
Data Types and Visual Variables

1. Data Types
Data types in visualization determine how data is represented and analysed. They guide the selection
of appropriate charts, graphs, and visual encodings.
Types of Data

1.Quantitative Data (Numerical Data):

o Represents measurable quantities.
o Examples: Temperature, sales figures, distance.
o Visualizations: Line graphs, bar charts, scatter plots.

2.Categorical Data (Qualitative Data):

o Represents discrete categories or labels.
o Examples: Gender, product types, regions.
o Visualizations: Bar charts, pie charts.

3.Ordinal Data:
o Represents ordered categories.
o Examples: Survey rankings (e.g., poor, average, excellent).
o Visualizations: Ordered bar charts, stacked bar charts.

98
4. Nominal Data:
o Represents non-ordered categories.
o Examples: Colors, brands, countries.
o Visualizations: Pie charts, categorical scatter plots.

5. Time-Series Data:
o Represents data points over time.
o Examples: Stock prices, temperature over days.
o Visualizations: Line charts, area charts.

6. Geospatial Data:
o Represents data tied to geographic locations.
o Examples: City population, rainfall in regions.
o Visualizations: Maps, choropleth maps.

7. Multivariate Data:
o Represents multiple variables simultaneously.
o Examples: Age, income, and education levels of individuals.
o Visualizations: Bubble charts, parallel coordinate plots

99
2. Visual Variables
Visual variables are the building blocks of data visualization. They are the attributes of graphical
elements that encode data and make patterns, relationships, or trends observable.
Types of Visual Variables

1. Position:
o Placement of data points on axes.
o Example: A scatter plot uses position to represent two variables on the x
and y axes.

2. Size:
o Represents magnitude through the size of an object.
o Example: Bubble charts use the size of bubbles to indicate a quantitative
variable.

3. Shape:
o Distinguishes between categories using different shapes.
o Example: A scatter plot uses circles, squares, or triangles for different
groups.

4. Color (Hue):
o Represents categories or intensity.
100
o Example: A heatmap uses gradients to show intensity; different hues
5. Brightness (Value):
o Represents magnitude by varying lightness or darkness.
o Example: A choropleth map uses darker shades for higher values.

6. Texture/Pattern:
o Differentiates areas or categories using patterns.
o Example: Bar charts use striped or dotted patterns to show subcategories.

7. Orientation:
o Uses angles or directions to encode data.
o Example: Arrows in vector field visualizations show wind direction.

8. Length:
o Represents quantitative differences by varying line lengths.
o Example: Bar charts use the length of bars to show values.

9. Width:
o Uses thickness to encode quantitative values.
o Example: Line graphs with varying thickness for different variables.

10. Enclosure:
o Groups related data points by enclosing them in shapes.
101
o Example: Venn diagrams.
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=YeI6Wqn4I78&pp=ygUXQ29sb3IgVGhlb3J5IGluIGV
uZ2xpc2g%3D

102
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Chart Types, Statistical Graphs, and Maps DISCOVER . LEARN .
EMPOWER
Charts and Graphs:

1.Bar Chart (horizontal bar graph)

A bar graph or a bar chart is a pictograph that uses bars instead of pictures to
display information. Bar graphs can be vertical or horizontal. In this article, graphs
with vertical bars are called column graphs or column charts. Bar graphs are used
to plot numeric values for levels of a categorical feature as bars.
In bar graphs, the values of the categorical data are represented along the x-axis
of the graph, and the length of the bars shows the values.

105
2. Stacked Bar Chart
A stacked bar chart is a bar graph with multiple data series stacked end-to-end
with the far-right end of the bar representing the totals of all the components in
the bar. The x-axis represents quantitative data while the y-axis categorical data.
Stacked bar graphs are used to show how a larger component is divided into its
various entities, and the relative effect each entity has on the total entity i.e., part
-to-whole relationship. Each data series takes a different shade or color explained
using a legend.

106
3. 100% Stacked Bar Chart
This variation of the stacked bar chart plots the percent of the values instead of
the actual values. The total of each stacked bar always equals 100%.

107
4. Column Charts (Vertical Bar graphs)
Colum charts or column graphs are bar charts with vertical bars. In column charts,
the values of the categorical data is shown along the y-axis of the graph, and the
height of the bars denote the values.

108
5. Stacked Column Charts
Like stacked bar graphs, stacked column charts have each bar representing the
whole and each segment denotes the various parts of the whole. The y-axis
represents quantitative data while the x-axis categorical data.

109
6. 100% Stacked Column Graph
This variation of the stacked column chart plots the percent of the values instead
of the actual values. The total of each stacked bar always equals 100%.

110
7. Grouped Bar Charts and Grouped Column Charts
A grouped bar/column chart (clustered bar/column graph) is another variation of
the bar/column chart that compares different categories of two or more groups.
The categories are grouped and arranged side-by-side making interpretation easy
inside the groups and even between the same categories. They are useful in
making comparisons across different categories of data.

111
Maps in Data Visualization:
Map visualization is used to analyse and display the geographically related data
and present it in the form of maps. This kind of data expression is clearer and
more intuitive. We can visually see the distribution or proportion of data in each
region. It is convenient for everyone to mine deeper information and make better
decisions.

1. Point Map
Point maps are straightforward, especially for displaying data with a wide
distribution of geographic information. For example, some companies have a wide
range of business. If the company wants to view the data of each site (specific
location) in a certain area, it will be more complicated to implement with general
maps, and the accuracy is not high. Then you can use the point map for precise
and fast positioning.

2. Line Map
You may not use line maps often, because they are relatively difficult to draw.
However, the line map sometimes contains not only space but also time. For the
analysis of special scenes, its application value is particularly high.

3. Regional Map
A regional map is also called a filling map. It can be displayed by country,
province, city, district or even some customized maps. You can know data sizes 112
4. Flow Map
Flow maps are often used to visualize origin-destination flow data. The origin and
destination can be points or surfaces. The interaction data between the origin and
the destination is usually expressed by a line that connects the geometric centre
of gravity of the space unit. The width or colour of the line indicates the flow
direction value between the origin and the destination. Each spatial location can
be either a origin or a destination.

5. Heatmap
The heatmap is used to indicate the weight of each point in the geographical
range. It is usually displayed in a special highlight. As shown in the figure below, it
is a haze map. The darker the colour of the area, the worse the air quality of the
area.

113
Trees in Data Visualization
In data visualization, trees represent hierarchical or nested data structures. They
use nodes and edges to display relationships between items, typically illustrating
parent-child or ancestor-descendant relationships. Tree-based visualizations make
complex data easier to explore and interpret.

Types of Tree Visualizations

1.Node-Link Diagrams:
o Represent trees using nodes (data points) and links (lines or edges) to
connect them.
o Example: Organizational charts, family trees.
2.Radial Trees (Circular Trees):
o Display hierarchical data in a circular layout.
o Parent nodes are placed at the centre, and child nodes radiate outward.
o Example: Phylogenetic trees.
3.Icicle Plots:
o Use stacked rectangles to represent hierarchy.
o Each layer represents a level in the hierarchy.
o Example: Folder structures, taxonomy classification.
4.Sunburst Charts:
o A radial equivalent of an icicle plot.
o Concentric circles represent hierarchical levels, with inner circles for parent
nodes and outer arcs for child nodes. 114
Networks in Data Visualization
Networks, also known as graphs in mathematical terms, are visual
representations of relationships or connections between entities. They consist
of nodes (entities) and edges (connections) and are used to analyse complex
systems or datasets with relational information.

Components of a Network

1.Nodes:
o Represent entities, objects, or data points.
o Example: People in a social network, servers in a computer network.
2.Edges:
o Represent relationships, interactions, or connections between nodes.
o Can be directed (with an arrow indicating direction) or undirected (no
direction).
3.Weights:
o Represent the strength or magnitude of a connection.
o Example: Friendship strength in social networks, distance in transportation
networks.
4.Clusters (Communities):
o Groups of nodes with denser connections within the group compared to
outside.
5.Attributes: 115
Types of Network Visualizations

1.Node-Link Diagrams:
o Use points for nodes and lines for edges.
o Example: Social networks, communication networks.
2.Adjacency Matrices:
o Use a grid format to represent connections between nodes.
o Example: Flight route maps, co-occurrence matrices.
3.Force-Directed Layouts:
o Arrange nodes based on attractive and repulsive forces to avoid overlap
and highlight relationships.
o Example: Relationship maps, knowledge graphs.
4.Hierarchical Networks:
o Represent data with a clear hierarchy or flow.
o Example: Organization charts, family trees.
5.Geospatial Networks:
o Overlay network connections on maps.
o Example: Road networks, internet infrastructure.
6.Dynamic Networks:
o Represent networks that evolve over time.
o Example: Social media interaction networks over days or weeks.

116
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=xVWKPSIBDIQ&pp=ygUyY2hhcnRzIGFuZCBwbG90
cyBpbiB2aXN1YWxpemF0aW9uIGluIGVuZ2xpc2ggbnB0ZWw%3D

117
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Acquisition of Data and Classification of Information DISCOVER . LEARN .
Sources EMPOWER
Types of Data Acquisition Methods :
Data acquisition refers to the process of gathering data from various sources for
analysis or processing. The choice of acquisition method depends on the type of
data needed, the context of the analysis, and the available resources. Below are
the primary types of data acquisition methods, explained in detail:

120
1. Manual Data Collection
Manual data collection involves human intervention to gather information, often
through direct interaction with subjects or environments. This method is typically
used when qualitative insights or specific information from individuals are
required.

Methods of Manual Data Collection:

•Surveys/Questionnaires:
o Data is collected by asking respondents a series of questions, often via
forms or online tools. Surveys can be open-ended (qualitative) or close-
ended (quantitative).
o Advantages: Customizable questions to fit research needs, can be
administered in person, by phone, or online.
o Disadvantages: Prone to biases (e.g., interviewer bias, social desirability
bias), time-consuming, and often requires large sample sizes to ensure
reliability.

121
•Interviews:
o Data is gathered through one-on-one or group interactions where
participants provide in-depth responses to questions.
o Advantages: Provides rich, detailed data; enables clarification and follow-
up questions; useful for sensitive topics.
o Disadvantages: Time-consuming, interviewer bias, and limited scalability
(i.e., difficult to reach a large sample size).

•Observations:
o Collecting data by observing subjects in a natural or controlled
environment. This method can be either participatory (where the observer
is involved in the activities) or non-participatory (where the observer is a
passive observer).
o Advantages: Useful for collecting behavioral data, often provides real-time
insights.
o Disadvantages: Subject to observer bias, time-consuming, and can lack
generalizability.

•Fieldwork:
o This method involves collecting data in natural settings, often used in social
sciences, anthropology, and environmental studies.
o Advantages: High ecological validity (real-world context), flexible and
adaptable.
122
o Disadvantages: Resource-intensive, time-consuming, and researcher bias.
2. Automated Data Collection
Automated data collection involves the use of technology (sensors, devices, or
software) to gather data with minimal human intervention. This method is highly
efficient for large-scale, real-time, and quantitative data collection.
Methods of Automated Data Collection:

•Sensors and IoT Devices:

o Internet of Things (IoT) devices and sensors collect data from the physical
world, often in real time. Examples include temperature sensors, motion
detectors, smart meters, and environmental sensors.
o Advantages: Real-time data collection, high accuracy, minimal human
error, ability to monitor remote or hazardous environments.
o Disadvantages: High initial setup cost, requires technical expertise for
maintenance and troubleshooting.

•Web Scraping:
o Web scraping uses automated scripts or tools to extract data from
websites. It can be used to collect data like product prices, social media
mentions, or news articles.
o Advantages: Large-scale data collection, cost-effective compared to
manual methods.
o Disadvantages: Legal and ethical issues (e.g., violating terms of service),
data may be unstructured or inconsistent.
123
•Application Programming Interfaces (APIs):
o APIs allow software systems to request data from other systems. For
example, social media platforms like Twitter or Google provide APIs to
access their data programmatically.
o Advantages: Easy to integrate into systems, real-time data, can be
automated.
o Disadvantages: Limited by the API’s rate limits, may require technical
expertise, data availability depends on third-party services.

•Data Logging:
o Automated systems can log data over time, often used for scientific
experiments, industrial systems, or machine monitoring.
o Advantages: Continuous, real-time data collection, accurate and
consistent.
o Disadvantages: Data can be voluminous, requiring significant storage and
processing resources.

124
3. Remote Sensing
Remote sensing refers to the acquisition of data about an object or area from a
distance, typically using satellites or aerial sensors. This method is primarily used
for geospatial data collection in fields like meteorology, agriculture, and
environmental monitoring.
Methods of Remote Sensing:
•Satellite Imaging:
o Satellites equipped with cameras or sensors capture high-resolution images
and data of the Earth’s surface. This includes visual imagery, infrared,
radar, and other spectral data.
o Advantages: Covers large areas, real-time or near-real-time data, non-
invasive, useful for environmental and geospatial studies.
o Disadvantages: High cost, limited resolution for some sensors, and
affected by weather conditions.

125
•Aerial Surveys (Drones):
o Drones or UAVs (Unmanned Aerial Vehicles) can capture high-resolution
images, videos, and other data types from the air. Drones are often used
for mapping, monitoring crop health, and inspecting infrastructure.
o Advantages: High resolution, flexible, can be deployed quickly, relatively
low cost.
o Disadvantages: Limited by flight time and range, may require regulatory
approval, can be impacted by weather.

126
•LiDAR (Light Detection and Ranging):
o A remote sensing technology that uses laser pulses to measure distances
to the Earth’s surface, creating precise 3D maps of landscapes.
o Advantages: Highly accurate, can penetrate vegetation, and create
detailed topographical models.
o Disadvantages: Expensive equipment, requires skilled operators, and
large amounts of data processing.
•Hyperspectral Imaging:
o This technique captures a broad range of spectral bands beyond visible
light to detect materials, environmental conditions, and more.
o Advantages: Detects materials and conditions invisible to the human eye,
detailed data.
o Disadvantages: Complex data analysis, expensive, and large data storage
requirements.

127
4. Transactional Data Collection
Transactional data is generated through exchanges between systems or users.
This data typically includes records of actions, transactions, or communications
and is often stored in digital databases.
Methods of Transactional Data Collection:

•Point of Sale (POS) Systems:

o Retail or transaction systems that record sales, purchases, and financial
transactions. Data collected can include items purchased, prices,
quantities, and customer details.
o Advantages: Accurate, real-time data collection, easy to analyse, and
often integrated into other business systems.
o Disadvantages: Privacy concerns, requires proper database management.

•Online Transactions:
o Data gathered from digital purchases or activities on websites and apps,
such as e-commerce platforms, banking, or gaming.
o Advantages: Real-time data collection, large data volumes, easy to
analyse.
o Disadvantages: Privacy issues, data security concerns.

•Sensor Data:
o Devices or sensors that track actions, such as RFID tags in logistics or 128
sensors in smart devices, gather transactional data based on movements,
5. Data Mining
Data mining refers to the process of discovering patterns, relationships, or trends
in large datasets, often using machine learning or statistical techniques. This
method is widely used for uncovering hidden insights from transactional or large-
scale data.
Methods of Data Mining:

•Clustering:
o Grouping similar data points together. For example, clustering customer
data based on purchasing behavior.
o Advantages: Helps identify natural groupings in data, such as customer
segments.
o Disadvantages: Results depend on the algorithm used, may not always
find meaningful groups.

•Classification:
o Categorizing data points into predefined classes. For example, classifying
emails as spam or not spam.
o Advantages: Useful for predictive modeling, easy to interpret.
o Disadvantages: Requires labeled data, sensitive to data quality.

129
•Association Rule Mining:
o Discovering interesting relations or patterns between variables, such as
items frequently bought together in a supermarket.
o Advantages: Uncovers hidden relationships, useful for marketing and
recommendation systems.
o Disadvantages: May produce irrelevant or trivial rules.

•Anomaly Detection:
o Identifying outliers or unusual patterns in data, often used for fraud
detection or quality control.
o Advantages: Effective for finding fraud or errors.
o Disadvantages: May generate false positives or miss subtle anomalies.

130
Classification of Information Sources
Information sources refer to the origins from which data is gathered.
Understanding the classification of information sources is critical for selecting the
appropriate data for research, analysis, or decision-making. Information sources
are typically classified based on their origin, whether the data is collected
firsthand or second-hand, the format of the data, or the method of access. Below
is a detailed explanation of the classification of information sources:

1. Primary Data Sources

Primary data refers to data that is collected directly from the source for the
specific research or analysis at hand. This type of data is raw and unprocessed,
and it provides firsthand, original information. Primary data is often preferred
when you need accurate, context-specific data.

Characteristics of Primary Data:

•Collected specifically for the research or analysis in question.
•Highly accurate, as it is gathered firsthand.
•Customizable based on the specific needs of the research.
•Can be qualitative (e.g., interviews) or quantitative (e.g., surveys).

131
Examples of Primary Data Sources:
•Surveys and Questionnaires: Data is collected through structured questions
either in person, via phone, or online. These can include closed-ended
(quantitative) or open-ended (qualitative) questions.
o Example: Market research survey for consumer preferences.
•Interviews: Data collected through one-on-one or group interactions where
participants respond to open or structured questions.
o Example: Expert interviews for academic research on healthcare systems.
•Experiments: Data gathered from controlled experiments where researchers
manipulate variables and measure outcomes. This is common in scientific,
medical, and social research.
o Example: Clinical trial data on the efficacy of a new drug.
•Observations: Data collected through direct observation of subjects or
phenomena. This is often used in natural sciences or social sciences where human
behaviour is studied.
o Example: Observing customer behaviour in a retail store.
•Case Studies: Data obtained from in-depth analysis of a particular individual,
group, event, or community.
o Example: A study on the impact of a new educational policy in a school.
•Fieldwork: Data collected in natural settings outside of a laboratory or
controlled environment. Common in anthropology, sociology, and environmental
sciences. 132
o Example: Anthropological research in remote villages or ecosystems.
2. Secondary Data Sources
Secondary data is data that has been collected by someone else for a purpose
other than the current research or analysis. This type of data is often used when
primary data is not available, too costly, or unnecessary for the specific analysis.

Characteristics of Secondary Data:

•Collected for purposes other than the current study.
•Already processed or aggregated, typically published in reports, databases, or
papers.
•Available in various formats: textual, statistical, and multimedia.

133
Examples of Secondary Data Sources:
•Government Reports and Statistics: Data published by government agencies
on various sectors like economics, health, education, and more.
o Example: U.S. Census data or national health surveys.
•Academic Research Papers: Published studies and academic articles that
provide insights, data, and findings on a particular topic.
o Example: Research articles in medical journals or social science papers.
•Industry Reports: Data collected and published by research firms or industry
experts.
o Example: Annual market research reports from firms like Nielsen or
McKinsey.
•Public Databases: Data repositories maintained by organizations, universities,
or governments that provide access to research datasets or public information.
o Example: World Bank Data, Google Scholar, or clinical trial databases.
•Books and Textbooks: Published books that may contain valuable data, case
studies, or historical data relevant to a field of study.
o Example: Textbooks with aggregated educational data, research
summaries, or historical data analyses.
•Media Sources: News articles, reports, and online publications that provide
secondary data on current events, trends, and societal changes.
o Example: News reports or magazine articles analyzing the impacts of
climate change. 134
3. Tertiary Data Sources
Tertiary sources are those that summarize, index, or compile data from primary
and secondary sources. These sources offer a high-level overview and are useful
for getting general information or an introduction to a topic. They do not contain
raw data themselves but point to other sources.
Characteristics of Tertiary Data:
•Summarizes or compiles information from primary and secondary sources.
•Provides high-level overviews or references rather than detailed analysis.
•Often used for general information or quick reference.
Examples of Tertiary Data Sources:
•Encyclopedias: Provide summaries of topics, often with references to primary
and secondary sources for more in-depth information.
o Example: Britannica Online or Wikipedia.
•Dictionaries and Thesauruses: Provide definitions or synonyms for terms and
concepts.
o Example: Merriam-Webster Dictionary.
•Bibliographies and Indexes: Lists or indexes of other sources (books, journals,
articles) that are relevant to a particular subject or field.
o Example: Indexes in academic journals that list relevant articles by subject.
•Almanacs: Provide factual data, statistics, or summaries of information for a
specific year, often compiled from other sources.
o Example: World Almanac, which provides data on global statistics and 135
events.
4. Open Data Sources

Open data refers to data that is freely available to the public and can be used,
modified, and shared without restriction. Open data sources are usually
government, non-profit, or academic repositories that provide data for research,
policy-making, and transparency.

Characteristics of Open Data:

•Freely accessible and can be reused.
•Published under open licenses (e.g., Creative Commons).
•Includes a wide range of data types and formats (structured, unstructured, etc.).
Examples of Open Data Sources:
•Government Open Data Portals: Governments often provide public access to
datasets on various topics such as population, economics, health, and
transportation.
o Example: Data.gov (U.S.), UK Data Service, or European Union Open Data
Portal.
•Research and Academic Data Repositories: Public repositories that host
research data for reuse.
o Example: Dryad, OpenICPSR (Inter-university Consortium for Political and
Social Research), or figshare.
•Non-Governmental Organizations (NGOs): NGOs often release datasets 136
related to social issues, human rights, or development work.
5. Commercial Data Sources
Commercial data sources refer to data that is collected and sold by private
organizations. These sources offer specialized, high-quality data that is often
aggregated, processed, and analysed for specific industries or research needs.
Characteristics of Commercial Data:
•Sold by private companies or data providers.
•Usually aggregated, curated, and enriched for specific applications.
•Often used for market research, competitive analysis, and customer insights.
Examples of Commercial Data Sources:
•Market Research Firms: Companies that provide industry-specific data on
market trends, consumer behavior, and competitor analysis.
o Example: Nielsen, Statista, and Gartner.
•Data Brokers: Firms that collect and sell large datasets on individuals,
demographics, or businesses.
o Example: Acxiom or Experian.
•Subscription-Based Databases: Paid services that provide access to
specialized datasets, such as financial data or scientific research.
o Example: Bloomberg Terminal, LexisNexis.

137
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=nX_Xp2hVc0s&pp=ygUkQWNxdWlzaXRpb24gb2Yg
RGF0YSBpbiBlbmdsaXNoIG5wdGVs

138
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Database Issues in Data Patterns DISCOVER . LEARN .
EMPOWER
Database Issues:
As data storage and retrieval form the backbone of database systems, several
challenges must be addressed to ensure efficiency, accuracy, and reliability. This
section explores the primary issues faced in data storage and retrieval and
provides solutions to tackle them.

141
1. Challenges in Data Storage and Retrieval

1.1 Storage Challenges

•Scalability:
o As data volumes grow, traditional storage methods may become inefficient,
requiring scalable solutions.
•Access Speed:
o Storage media like hard drives may not provide the necessary speed for
real-time applications.
•Redundancy and Fault Tolerance:
o Ensuring data integrity in case of hardware failures or system crashes.
•Data Consistency:
o Maintaining consistency in distributed systems with multiple storage nodes.
•Cost:
o Balancing performance requirements with the cost of high-speed storage
systems.

142
1.2 Retrieval Challenges

•Query Performance:
o Slow query execution due to inefficient indexing or large data volumes.
•Complex Queries:
o Handling complex queries involving multiple joins or aggregations.
•Concurrency:
o Managing simultaneous data access requests without conflicts or
bottlenecks.
•Heterogeneous Data:
o Retrieving and integrating data from different formats and sources.

143
Solutions:

1. In-Memory Database Storage

In-memory database storage is a modern approach to data storage that prioritizes
speed and efficiency by keeping data in the system's RAM instead of traditional
disk-based storage. This solution is designed to overcome the latency issues
inherent in disk I/O operations.
•How It Works:
o Data is loaded directly into the main memory (RAM) during initialization.
o The entire dataset remains in memory, allowing almost instant access to
records.
o Write operations are periodically synced with disk storage to prevent data
loss in case of power failures or crashes.
•Key Features:
o High Speed: By bypassing disk access, data retrieval and manipulation
occur at memory speeds, often hundreds of times faster than traditional
databases.
o Durability: Modern in-memory databases include mechanisms such as
snapshots and log files to ensure data persistence and recovery.

144
•Applications:
o Real-Time Systems: Used in industries where milliseconds matter, such
as financial trading platforms, where orders need to be executed instantly.
o Data Analytics: Enables real-time dashboards and analytics by providing
immediate access to large datasets.
o Gaming: Powers multiplayer online games where player actions and
updates must be synchronized instantly.

•Examples:
o Redis: A widely used key-value store database that excels in speed and
supports complex data structures like lists, sets, and sorted sets.
o SAP HANA: Integrates in-memory storage with advanced analytics
capabilities for enterprise applications.

145
2. Retrieval and Query Languages
Retrieving data efficiently is a core function of database systems. Query
languages, particularly SQL, play a vital role in accessing, manipulating, and
analyzing data. The performance of data retrieval depends significantly on how
queries are written, optimized, and executed.
2.1 SQL (Structured Query Language)
SQL is the most commonly used query language for relational databases. It
provides a standardized way to communicate with databases.

•Core SQL Commands:

o SELECT: Used for data retrieval.
Example: SELECT * FROM employees WHERE department = 'Sales';
o INSERT: Adds new records to a table.
Example: INSERT INTO employees (name, department) VALUES ('John Doe',
'Sales');
o UPDATE: Modifies existing records.
Example: UPDATE employees SET department = 'Marketing' WHERE id =
101;
o DELETE: Removes records from a table.
Example: DELETE FROM employees WHERE department = 'HR';

146
•Indexing:
o Speeds up data retrieval by creating a data structure that maps keys to
their corresponding values.
Example: Indexing the "employee_id" column in an employee table allows
quick lookups.
o Types of Indexes:
 B-Tree Index: Efficient for range queries and equality searches.
 Hash Index: Optimized for exact matches.

•Partitioning:
o Divides large datasets into smaller, manageable parts based on specific
criteria (e.g., date, region).
o Enables faster retrieval by narrowing down the search scope to a specific
partition.

147
2.2 Advanced Querying Mechanisms
•Query Optimization:
o Query optimizers in DBMS automatically choose the most efficient way to
execute a query, considering factors like table size, indexes, and join
strategies.
o Example: Rewriting a query to use indexed columns instead of scanning the
entire table.
•NoSQL Query Languages:
o Designed for unstructured and semi-structured data stored in NoSQL
databases.
•Graph Query Languages:
o Used in graph databases like Neo4j, which store data as nodes and edges.

148
3. Integration of Solutions
Combining in-memory storage with advanced retrieval methods ensures that
modern applications can handle both high performance and complex querying
needs:
•Hybrid Systems:
o Many DBMSs allow for hybrid setups where frequently accessed data is
stored in memory while less critical data resides on disk.
o Example: MySQL's InnoDB engine uses a buffer pool to store frequently
accessed pages in memory.
•Caching:
o Frequently queried data is cached in memory to reduce retrieval times.
Tools like Redis and Memcached act as in-memory layers above
traditional databases.
o Example: Storing results of a product catalog query in memory for quicker
responses to user searches.
•Streamlining Analytics:
o Combining in-memory storage with SQL-based analytical tools allows real-
time insights, crucial for decision-making in businesses.

149
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=8suRVB5h5-w&pp=ygUaRGF0YWJhc2UgSXNzdWVz
IGluIGVuZ2xpc2g%3D

150
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Predicting Variables DISCOVER . LEARN .
EMPOWER
Predicting Continuous Variables
Continuous variable prediction focuses on estimating numerical values. These
techniques are part of regression analysis and aim to model the relationship
between input variables (features) and a continuous output.
1.1 Linear Regression
Linear regression is a foundational method for predicting continuous variables.

153
Linear regression models the relationship between the dependent variable YY and
one or more independent variables XX. It assumes a linear relationship:
Y = β0 + β1X + ϵY = β0 + β1X + ϵ
o Y: Dependent variable (output).
o X: Independent variable (input).
o β0: Intercept of the line.
o β1: Slope of the line.
o ϵ: Error term (captures the noise).

•Applications:
o Predicting housing prices based on square footage, location, and amenities.
o Estimating annual revenue for a company.

154
1.2 Polynomial Regression
Polynomial regression is an extension of linear regression for capturing non-linear
relationships.

155
The model fits a polynomial equation:
Y = β0 + β1X + β2X2 +…+ βnXn + ϵY = β0 + β1X + β2X2 +…+ βnXn + ϵ

•Applications:
o Modelling growth trends, such as population over time.
o Predicting temperature changes in weather forecasting.

156
2. Discontinuous Variables
Discontinuous variables, also known as categorical variables, are variables that
take on distinct, separate values. They may represent categories, groups, or
binary states. Examples include gender, job type, or product categories.
Prediction of Discontinuous Variables
Predicting discontinuous variables is referred to as classification if the output is
categorical and ordinal regression if the output has a natural order.
2.1 Logistic Regression
Logistic regression predicts the probability of a binary outcome.

157
The model transforms a linear equation using the sigmoid function:
P (Y=1∣X) = 11 +e− (β0+β1X) P(Y=1∣X) = 1+e − (β0+β1X) 1
o Outputs probabilities that are mapped to categories (e.g., 0 or 1).
•Applications:
o Predicting customer churn (Yes/No).
o Classifying tumours as benign or malignant.

2.2 Decision Trees

Decision trees split data into branches based on feature values.

158
o Nodes represent feature-based decisions.
o Leaves represent class labels.
o Splitting is based on measures like Gini Index or Information Gain.

•Applications:
o Predicting loan approval.
o Diagnosing diseases based on symptoms.

159
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=fWOV6n9nv7c

160
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Techniques for plotting data DISCOVER . LEARN .
EMPOWER
Techniques for plotting data:
Data visualization plays a crucial role in data analysis as it helps to make sense of
complex information and extract valuable insights. By presenting data visually
through charts, graphs, and interactive visualizations, it becomes much easier for
analysts to identify patterns, trends, and outliers. This visual representation
allows for faster comprehension and better decision-making.

One of the key benefits of data visualization is its ability to simplify large
datasets. Instead of trying to interpret rows upon rows of numbers or text,
visualizations provide a clear and concise summary that can be understood at a
glance. This not only saves time but also reduces the risk of errors or
misunderstandings.

Types of Data Visualization Techniques

Data visualization is a powerful tool that helps transform complex data into easily
understandable visual representations. There are various techniques available to
visualize data, each with its own strengths and applications.

1. Bar Charts: A classic technique, bar charts are effective for comparing
different categories or displaying changes over time. They use vertical or
horizontal bars to represent data values. 163
164
2. A Box and Whisker Plot
A box and whisker plot, or box plot, provides a visual summary of data through its
quartiles. First, a box is drawn from the first quartile to the third of the data set. A
line within the box represents the median. “Whiskers,” or lines, are then drawn
extending from the box to the minimum (lower extreme) and maximum (upper
extreme). Outliers are represented by individual points that are in-line with the
whiskers.

165
3. Waterfall Chart
A waterfall chart is a visual representation that illustrates how a value changes as
it’s influenced by different factors, such as time. The main goal of this chart is to
show the viewer how a value has grown or declined over a defined period. For
example, waterfall charts are popular for showing spending or earnings over time.

166
4. Area Chart
An area chart, or area graph, is a variation on a basic line graph in which the area
underneath the line is shaded to represent the total value of each data point.
When several data series must be compared on the same graph, stacked area
charts are used. This method of data visualization is useful for showing changes in
one or more quantities over time, as well as showing how each quantity combines
to make up the whole. Stacked area charts are effective in showing part-to-whole
comparisons.

167
5. Pictogram Chart
Pictogram charts, or pictograph charts, are particularly useful for presenting
simple data in a more visual and engaging way. These charts use icons to
visualize data, with each icon representing a different value or category. For
example, data about time might be represented by icons of clocks or watches.
Each icon can correspond to either a single unit or a set number of units (for
example, each icon represents 100 units).

168
6. Highlight Table
A highlight table is a more engaging alternative to traditional tables. By
highlighting cells in the table with color, you can make it easier for viewers to
quickly spot trends and patterns in the data. These visualizations are useful for
comparing categorical data. Depending on the data visualization tool you’re
using, you may be able to add conditional formatting rules to the table that
automatically color cells that meet specified conditions. For instance, when using
a highlight table to visualize a company’s sales data, you may color cells red if
the sales data is below the goal, or green if sales were above the goal. Unlike a
heat map, the colors in a highlight table are discrete and represent a single
meaning or value.

169
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=4lxA7lo9GLU&pp=ygUnVGVjaG5pcXVlcyBmb3IgcG
xvdHRpbmcgZGF0YSBpbiBlbmdsaXNo

170
THANK YOU

For queries
Email: [email protected]
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE ENGINEERING

Data Visualization
(CSH-461)

Prepared By : Shivam Sharma(E-16516)

Topic: Evaluating Suitability for Different Data Types DISCOVER . LEARN .
EMPOWER
Evaluating Suitability for Different Data Types
1. Understanding Data Types:
•Numerical Data:
o Description: Quantitative data that can be measured and expressed
numerically.
o Types:
 Discrete Data: Countable data, typically integers (e.g., number of
books).
 Continuous Data: Data that can take any value within a range (e.g.,
temperature, height).
•Categorical Data:
o Description: Qualitative data represented by categories or labels.
o Types:
 Nominal Data: Categories without any specific order (e.g., blood type,
eye color).
 Ordinal Data: Categories with a meaningful order (e.g., ratings,
education level).
•Time-Series Data:
o Description: Data collected at successive points in time, often at uniform
intervals.
o Examples: Stock prices, daily temperature readings, monthly sales figures.
•Spatial Data: 173
o
2. Evaluating Suitability:

•Numerical Data:
o Suitability: Ideal for performing mathematical operations, statistical
analysis, and creating models.
o Examples: Financial data analysis, scientific experiments, engineering
measurements.

o Visualization Techniques:
 Line Charts: Effective for showing trends over time or continuous data.
 Scatter Plots: Useful for showing relationships between two numerical
variables.
 Histograms: Great for displaying the distribution of a dataset.

•Categorical Data:
o Suitability: Best for grouping and categorizing information for comparison.
o Examples: Market segmentation, customer feedback analysis,
demographic studies.

o Visualization Techniques:
 Bar Charts: Excellent for comparing different categories.
 Pie Charts: Useful for showing proportions and parts of a whole. 174

•Time-Series Data:

o Suitability: Crucial for analysing and forecasting trends and patterns over
time.
o Examples: Economic indicators, climate data, sales performance.
o Visualization Techniques:
 Time-Series Plots: Ideal for showing data points over time.
 Line Graphs: Great for visualizing trends and changes over periods.
 Area Charts: Useful for showing cumulative data over time.

•Spatial Data:

o Suitability: Essential for analysing geographical information and spatial

patterns.
o Examples: Urban planning, environmental monitoring, logistics
optimization.
o Visualization Techniques:
 Maps: Basic method for displaying geographical data.
 Heatmaps: Effective for showing density or intensity of data points in a
region.
 Choropleth Maps: Useful for representing data values across different 175
3. Considerations for Data Suitability:

•Data Quality: Ensuring the accuracy, completeness, and reliability of the data.
•Data Format: The structure and format of data, affecting ease of processing and
analysis.
•Analysis Requirements: Specific needs of the analysis, such as level of detail
and precision.
•Visualization Needs: Choosing the best way to visually represent data to
communicate insights.
•Scalability: The ability to handle and process large volumes of data efficiently.

176
4. Choosing the Right Tool:

•Numerical Data Tools:

o Excel: User-friendly for basic statistical analysis and visualization.
o Python (NumPy, Pandas): Powerful for advanced data manipulation and
analysis.
o R: Comprehensive statistical computing environment.
•Categorical Data Tools:
o Excel: Suitable for simple categorizations and visualizations.
o Python (Pandas): Excellent for handling and analyzing categorical data.
o R: Ideal for in-depth categorical data analysis.
•Time-Series Data Tools:
o Python (Pandas, Matplotlib): Versatile for time-series data analysis and
visualization.
o R: Strong capabilities for time-series analysis and forecasting.
•Spatial Data Tools:
o GIS Software: Specialized software like ArcGIS for geographical data
analysis.
o Python (Geopandas): Powerful for spatial data manipulation and
visualization.
o QGIS: Open-source software for spatial data analysis and visualization.
177
References:
Ben Fry, Visualizing Data: Exploring and Explaining Data with Processing Environment,
O'Reilly Media.

Video Link:
https://www.youtube.com/watch?v=4lxA7lo9GLU&pp=ygUnVGVjaG5pcXVlcyBmb3IgcG
xvdHRpbmcgZGF0YSBpbiBlbmdsaXNo

178
THANK YOU

For queries
Email: [email protected]

Lecture 1.1.1
No ratings yet
Lecture 1.1.1
10 pages
Data Visualization
No ratings yet
Data Visualization
9 pages
DV Classnotes
No ratings yet
DV Classnotes
28 pages
ADTHEORY4
No ratings yet
ADTHEORY4
13 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit 2
No ratings yet
Unit 2
53 pages
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
No ratings yet
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
22 pages
Data Mining - Unit 1
No ratings yet
Data Mining - Unit 1
45 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
20 pages
Unlocking Insights from Unstructured Data
No ratings yet
Unlocking Insights from Unstructured Data
27 pages
Importance of Data Mining Explained
No ratings yet
Importance of Data Mining Explained
17 pages
Unit 1 - PPT
No ratings yet
Unit 1 - PPT
67 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Unsupervised Learning in Data Mining
No ratings yet
Unsupervised Learning in Data Mining
9 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Internship
No ratings yet
Internship
12 pages
Data Warehousing Mining
No ratings yet
Data Warehousing Mining
26 pages
Data Mininng
No ratings yet
Data Mininng
11 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
33 pages
Data Mining Techniques and Processes
No ratings yet
Data Mining Techniques and Processes
45 pages
Data Mining & Warehousing Basics
No ratings yet
Data Mining & Warehousing Basics
30 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
3 pages
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
No ratings yet
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
72 pages
Unit 3
No ratings yet
Unit 3
14 pages
Dou 08-08-2025
No ratings yet
Dou 08-08-2025
13 pages
DWDM (BCS058) 2nd UNIT NOTES
No ratings yet
DWDM (BCS058) 2nd UNIT NOTES
39 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Chapter 1.3
No ratings yet
Chapter 1.3
9 pages
Data Mining
No ratings yet
Data Mining
22 pages
Methods and Techniques of Data Processing
No ratings yet
Methods and Techniques of Data Processing
22 pages
Module 3
No ratings yet
Module 3
30 pages
DWDM 2 Unit Notes
No ratings yet
DWDM 2 Unit Notes
14 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Build The Models
No ratings yet
Build The Models
7 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Text Extraction (Irs - Unit 2)
No ratings yet
Text Extraction (Irs - Unit 2)
103 pages
Data Mining Practical 123
No ratings yet
Data Mining Practical 123
26 pages
Data Mining in Data Warehousing Explained
No ratings yet
Data Mining in Data Warehousing Explained
20 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
Text Extraction
No ratings yet
Text Extraction
79 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Data Warehousing&Dat Mining
No ratings yet
Data Warehousing&Dat Mining
12 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
Data Processing for ICT Students
No ratings yet
Data Processing for ICT Students
16 pages
Data Extraction Part2
No ratings yet
Data Extraction Part2
15 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
6 pages
Devi Ahilya Vishwavidyalaya, Indore: Session 2018 - 19
No ratings yet
Devi Ahilya Vishwavidyalaya, Indore: Session 2018 - 19
5 pages
DWH Unit 3
No ratings yet
DWH Unit 3
7 pages
Data Mining Insights for Businesses
No ratings yet
Data Mining Insights for Businesses
3 pages
Finally Blocks in Try/Catch Structures
No ratings yet
Finally Blocks in Try/Catch Structures
18 pages
Tetris
No ratings yet
Tetris
3 pages
MQL4 Event Handling Guide
No ratings yet
MQL4 Event Handling Guide
3 pages
Essay Plan Guidelines
No ratings yet
Essay Plan Guidelines
3 pages
Bachelor Degree Thesis Example
100% (3)
Bachelor Degree Thesis Example
8 pages
Level Transition Guide: Unit 1 To Units 1-3 Checkpoint
No ratings yet
Level Transition Guide: Unit 1 To Units 1-3 Checkpoint
8 pages
360 Reviews: Public Document Version: 2H 2020 - 2021-02-05
No ratings yet
360 Reviews: Public Document Version: 2H 2020 - 2021-02-05
154 pages
Bahan Ajar Language Material Development BA - LANGUAGE MATERIAL DEVELOPMENT - SEMESTER 2023-2024 - Sri Wahyuni
No ratings yet
Bahan Ajar Language Material Development BA - LANGUAGE MATERIAL DEVELOPMENT - SEMESTER 2023-2024 - Sri Wahyuni
24 pages
10 Ea Transfer Order
No ratings yet
10 Ea Transfer Order
2 pages
CPU Scheduling in Operating Systems
No ratings yet
CPU Scheduling in Operating Systems
3 pages
Faith and Struggle in the Midwest
No ratings yet
Faith and Struggle in the Midwest
2 pages
The Kite Runner Schedule 2016
No ratings yet
The Kite Runner Schedule 2016
2 pages
Modal Verb "May" Explained
No ratings yet
Modal Verb "May" Explained
2 pages
CLX000 Intro
No ratings yet
CLX000 Intro
18 pages
Trigonometry for Stage 5 Students
No ratings yet
Trigonometry for Stage 5 Students
45 pages
Bhs Inggris1
No ratings yet
Bhs Inggris1
94 pages
06-Simple Harmonic Motion
No ratings yet
06-Simple Harmonic Motion
7 pages
Language Use Among Gaddang Speakers
No ratings yet
Language Use Among Gaddang Speakers
111 pages
Student Exam Scores Analysis
No ratings yet
Student Exam Scores Analysis
6 pages
Jainism and Buddhism
No ratings yet
Jainism and Buddhism
2 pages
Aspiring Elementary Educator
No ratings yet
Aspiring Elementary Educator
2 pages
Understanding the Hadith of Gabriel
No ratings yet
Understanding the Hadith of Gabriel
12 pages
B System Setup CG ncs5000 77x
No ratings yet
B System Setup CG ncs5000 77x
86 pages
Guitar Tabs - Learn To Read Tabs in 60 Minutes or Less
100% (1)
Guitar Tabs - Learn To Read Tabs in 60 Minutes or Less
32 pages
HSK 3 Vocabulary Flashcards & List
100% (4)
HSK 3 Vocabulary Flashcards & List
13 pages
PLC 2 Ladder Diagram
100% (1)
PLC 2 Ladder Diagram
53 pages
2023 KICD CBC Teacher Orientation Slides
0% (1)
2023 KICD CBC Teacher Orientation Slides
231 pages
Samir Tak's Resume Overview
No ratings yet
Samir Tak's Resume Overview
2 pages
Business Plan Overview in Hindi
No ratings yet
Business Plan Overview in Hindi
5,266 pages
Grade 4 English Semester 1 Guide
No ratings yet
Grade 4 English Semester 1 Guide
6 pages

Data Visualization

Uploaded by

Data Visualization

Uploaded by

UNIVERSITY INSTITUTE OF

Prepared By : Shivam Sharma(E-16516)

Techniques for Data Extraction

1. Manual data extraction

4. Template-based data extraction

Prepared By : Shivam Sharma(E-16516)

1.Batch Integration Architecture

3. Message-Oriented Middleware (MOM) Architecture

5. Federated Integration Architecture

6. Hybrid Integration Architecture

2.Attribute Subset Selection:

Types of Data Transformations

1.Structuring Data: This involves organizing unstructured or semi-structured data

2.Cleaning and Validation: Data cleaning is about removing errors and

3.Aggregation: Aggregating data involves combining data from multiple sources or 18

5. Time-Based Transformations: These are transformations that involve time-

3.Data Integration Platforms: These platforms provide a comprehensive solution

4.Scripting Languages: Languages like Python and R, though not exclusively

5.Cloud-based Data Transformation Services: These services offer scalable and

Prepared By : Shivam Sharma(E-16516)

Data visualization is a way of turning data into information. It’s a visual

1. Visuals Help Learn More Effectively

3. Effective Data Visualization Is a Superpower of Any Business

4. If a Picture Verifies the Story, It Sells

5. Data Visualization Helps Explain Cause-And-Effect Relationships

Prepared By : Shivam Sharma(E-16516)

Prepared By : Shivam Sharma(E-16516)

1. Enhanced Data Quality:

2. Improved Accuracy of Analysis:

5. Increased Productivity and Efficiency:

6. Better Data Integration:

Data cleaning plays a foundational role in data analysis, enabling reliable

Common Data Cleaning Techniques

1. Handling Missing Values

Duplicate data can lead to misleading analyses. Data deduplication techniques

For text-based data, techniques like removing stop words, stemming,

In order to create powerful machine learning models, feature engineering and

Prepared By : Shivam Sharma(E-16516)

Multivariate Data Visualization:

Prepared By : Shivam Sharma(E-16516)

3. t-Distributed Stochastic Neighbour Embedding (t-SNE)

4. Uniform Manifold Approximation and Projection (UMAP)

Prepared By : Shivam Sharma(E-16516)

Prepared By : Shivam Sharma(E-16516)

1.Visual Design and Art:

1.Quantitative Data (Numerical Data):

2.Categorical Data (Qualitative Data):

Prepared By : Shivam Sharma(E-16516)

1.Bar Chart (horizontal bar graph)

Types of Tree Visualizations

Prepared By : Shivam Sharma(E-16516)

Methods of Manual Data Collection:

•Sensors and IoT Devices:

•Point of Sale (POS) Systems:

1. Primary Data Sources

Characteristics of Primary Data:

Characteristics of Secondary Data:

Characteristics of Open Data:

Prepared By : Shivam Sharma(E-16516)

1.1 Storage Challenges

1. In-Memory Database Storage

•Core SQL Commands:

Prepared By : Shivam Sharma(E-16516)

2.2 Decision Trees

Prepared By : Shivam Sharma(E-16516)

Types of Data Visualization Techniques

Prepared By : Shivam Sharma(E-16516)

o Suitability: Essential for analysing geographical information and spatial

•Numerical Data Tools:

You might also like