0% found this document useful (0 votes)

7 views13 pages

DVT Unit 2

The document discusses various data visualization techniques, including Word Clouds and Time Series Data, emphasizing their applications and limitations. It outlines the steps for Exploratory Data Analysis (EDA) and data collection methods, such as Single and Multiple Sources, along with web scraping for automated data collection. Additionally, it covers data cleaning, aggregation, and the importance of aesthetics in visual representation, highlighting how to effectively map data onto visual elements.

Uploaded by

iron pump

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views13 pages

DVT Unit 2

Uploaded by

iron pump

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Word Clouds

Also known as aTag Cloud.

A visualisation method that displays how frequently words appear in a given body of text, by making the size of
each word proportional to its frequency. All the words are then arranged in a cluster or cloud of words.
Alternatively, the words can also be arranged in any format: horizontal lines, columns or within a shape.

Word Clouds can also be used to display words that have meta-data assigned to them. For example, in a
Word Cloud of all the World's countries, the population could be assigned to each country's name to determine
its size.

Colour used on Word Clouds is usually meaningless and is primarily aesthetic, but it can be used to categorise
words or to display another data variable.

Typically, Word Clouds are used on websites or blogs to depict keyword or tag usage. Word Clouds can also
be used to compare two different bodies of text together.

Although simple and easy to understand, Word Clouds have some major flaws:
- Long words are emphasised over short words.
- Words whose letters contain many ascenders and descenders may receive more attention.
- They're not great for analytical accuracy, so used more for aesthetic reasons instead.
Time Series Data
Time series data is information collected in sequence over time. It shows how things change at different points,
like stock prices every day or temperature every hour.

- It is used in industries such as finance, pharmaceuticals, social media, and research.

- Analyzing and visualizing this data helps us find trends, seasonal patterns, and behaviors.
- These insights support forecasting and guide better decision-making.
- The main goal is to study data in time order to extract meaningful patterns and predictions.

EDA :
Steps
1. Data understanding - Data types, why?
2. Data Collection
3. Data Cleaning - Handling Missing values and Inconsistencies
4. Visualization - univariate, bivariate, multivariate
5. Handling Outliers
6. Findings and insights

Types
1. Univariate
2. Bivariate
3. Multivariate
UNIT 2

DATE COLLECTION

1. Single Source Data Collection

What It Means
Single source means you're pulling all your data from one place - one database, one spreadsheet, one API, or
one system.
Example in Practice
Imagine you work for an e-commerce company and you need to analyze sales performance. You extract all
your data directly from the company's sales database - customer purchases, transaction amounts, dates,
product categories, etc. Everything comes from that one database.

Advantages - Why Use Single Source?

Easier Integration: Since everything comes from one place, you don't need to worry about combining different
data formats or structures. It's like using LEGO blocks from the same set - they all fit together perfectly.
Easier Cleaning: The data typically follows consistent formatting rules, has the same data types, and uses the
same conventions. For example, dates will all be in the same format (MM/DD/YYYY or DD/MM/YYYY),
currency will be consistent, etc.

Limitations - The Downside

Lack of Context: Your data might tell you WHAT happened but not WHY. For instance, your sales data shows a
spike in December, but without external data (like marketing campaign data or seasonal trends), you can't
explain why.
Lack of Completeness: You're limited to what that one source tracks. Your sales database might not include
customer sentiment, competitor pricing, economic indicators, or other factors that influence sales.

2. Multiple Sources Data Collection

What It Means
Multiple sources means you're combining data from different places to create a more complete picture. You're
essentially connecting different puzzle pieces from different boxes.
Example in Practice
An agricultural researcher wants to understand crop yields. They combine:

Weather data (temperature, rainfall, humidity) from meteorological services

Crop yield data (harvest quantities, quality grades) from farms
Soil data (pH levels, nutrients) from agricultural databases

By merging these, they can analyze how weather patterns affect crop production.

Advantages - The Power of Multiple Sources

Richer Insights: You can see relationships and patterns that wouldn't be visible with just one dataset. In the
agricultural example, you can correlate rainfall amounts with yield quality.
More Comprehensive: You get a 360-degree view of your subject. Instead of just knowing sales numbers, you
might combine them with customer reviews, social media sentiment, and competitor data.

Challenges - The Complications

Data Integration: Different sources use different formats. One dataset might have dates as "01/15/2024" while
another uses "2024-01-15" or "January 15, 2024". You need to standardize everything.
Matching Formats: Beyond dates, you might have different units (kilometers vs. miles), different naming
conventions (USA vs. United States), or different levels of granularity (daily vs. monthly data).
Handling Inconsistencies: Sources might contradict each other, have different update frequencies, or measure
things differently. For example, one source might count "active users" as anyone who logged in that month,
while another counts only those who made a purchase.
Web Scraping - Automated Data Collection
What Is Web Scraping?

Web scraping is like having a robot assistant that visits websites and copies information for you automatically.
Instead of manually copying product prices or article headlines, you write code that does it for you at scale.

Tools for Web Scraping

Python Libraries - The three most popular:

1. BeautifulSoup: Best for parsing HTML/XML. It's like a smart search tool that finds specific elements on
a webpage (like all product names or prices).
2. Selenium: Controls a real web browser automatically. It can click buttons, fill forms, and handle
websites that require user interaction. Think of it as a robot clicking through a website just like you
would.
3. Scrapy: A complete framework for large-scale scraping. It's like BeautifulSoup on steroids - designed
for crawling entire websites efficiently.

Real-World Applications

E-commerce Price Monitoring: Scraping Amazon, Flipkart, or other sites to track product prices over time.
You could visualize price trends, find the best time to buy, or compare competitor pricing.

News Aggregation: Extracting articles from multiple news portals to analyze trending topics, sentiment, or
media coverage patterns.

Review Analysis: Gathering customer reviews and ratings from multiple platforms to understand product
sentiment, identify common complaints, or track brand reputation.

Important Considerations - The Rules of the Game

Ethical/Legal Boundaries: Not all websites allow scraping. Some explicitly prohibit it in their Terms of Service.
Always check before scraping. Just because you can access data doesn't mean you should.

robots.txt: This is a file websites use to tell scrapers which parts of the site they can access. For example,
amazon.com/robots.txt tells you what Amazon allows. Respecting this is both ethical and legal.

Server Load: If you send too many requests too quickly, you can overwhelm a website's servers (similar to a
DDoS attack). Always add delays between requests and be respectful of the website's resources.

Dynamic Content Handling: Modern websites load content with JavaScript. If you scrape the initial HTML,
you might get an empty page. This is where tools like Selenium become necessary - they wait for the
JavaScript to execute and content to load.
Data Cleaning - Removing Noise and Errors
1. Handling Missing Values

The Problem: Real-world data often has gaps. A survey respondent skips a question, a sensor fails, or a
database field is left blank.

Two Main Approaches:

Drop: Simply remove rows or columns with missing data. Use this when you have plenty of data and the
missing values are random.

● Example: If only 5 out of 10,000 customer records lack email addresses, just drop those 5.

Impute: Fill in missing values intelligently.

● Mean/Median imputation: Replace missing ages with the average age

● Forward fill: Use the last known value (good for time series)
● Predictive imputation: Use other columns to predict the missing value

Example: If you're visualizing temperature trends and one day's reading is missing, you might impute it as the
average of the day before and after.

2. Removing Duplicates

The Problem: The same record appears multiple times, which skews your analysis and visualizations.

Example: A customer accidentally submitted the same order twice, or data was imported multiple times. If you
visualize total sales without removing duplicates, your revenue chart will be inflated.

Solution: Identify duplicate rows based on key fields (like customer ID + timestamp) and keep only one copy.

3. Correcting Inconsistent Formats

The Problem: The same information is represented differently, making it impossible to group or compare.

Common Examples:

Dates:

● "01/15/2024" vs "2024-01-15" vs "15-Jan-24" vs "January 15, 2024"

● Solution: Standardize to one format, like ISO 8601 (YYYY-MM-DD)

Currency:

● "$1,500.00" vs "1500 USD" vs "₹1,12,500"

● Solution: Convert to numeric values with a consistent currency

Units:

● "5 km" vs "5000 m" vs "3.1 miles"

● Solution: Convert all to the same unit

Why It Matters for Visualization: If you're creating a timeline chart and your dates are in different formats, the
system won't recognize them as dates and your chart will break.
Data Aggregation - Combining Data at Different Levels
What Is Aggregation?

Aggregation means summarizing detailed data into higher-level summaries. It's like zooming out to see the
bigger picture instead of individual details.

1. Summarization - Rolling Up Data

Example: Daily to Monthly Sales

● Raw data: Sales for every single day (Jan 1: $500, Jan 2: $750, Jan 3: $600...)
● Aggregated: Total sales for January: $45,000

Why Do This?

● Reduces Complexity: A chart with 365 daily points is cluttered; 12 monthly points is clearer
● Makes Patterns Visible: Daily fluctuations might hide the overall upward trend that becomes obvious
monthly

2. Grouping - Comparing Categories

Example: Average Test Scores per Department

● Raw data: Individual student scores (Student A: 85, Student B: 92, Student C: 78...)
● Grouped: Computer Science avg: 87, Mathematics avg: 82, Physics avg: 90

Why Do This? Enables comparison between categories. You can create a bar chart comparing departments,
which would be impossible with individual student data.

Benefits of Aggregation

Reduces Complexity: Instead of visualizing 1 million individual transactions, you show monthly totals - much
more digestible.

Makes Patterns Visible: Noise in individual data points can obscure trends. Aggregation smooths out random
variations and reveals the underlying patterns.

Performance: Visualizing aggregated data is faster and uses less memory than raw, detailed data.

The Pre-processing Pipeline

Here's how these steps typically flow:

1. Collect raw data (via web scraping or other methods)

2. Clean it (handle missing values, remove duplicates, fix formats)
3. Aggregate it (summarize or group as needed)
4. Visualize it (now your data is ready for charts and graphs!)
Practical Example - Putting It All Together
Let's say you scraped product reviews from an e-commerce site:

Raw Data Issues:

● Some reviews missing star ratings

● Duplicate reviews from bots
● Dates in different formats
● Prices in different currencies

Pre-processing Steps:

1. Impute missing ratings with the product average

2. Remove duplicate reviews based on user ID + timestamp
3. Standardize all dates to YYYY-MM-DD
4. Convert all prices to USD
5. Aggregate reviews by month to see sentiment trends

Now you can visualize: A line chart showing average monthly ratings over time, revealing that ratings
dropped after a price increase.

3. Mapping Data onto Aesthetics

The Core Concept

"Aesthetics" in data visualization doesn't mean "making things pretty" - it refers to visual properties that your
eyes can perceive and your brain can interpret. Think of aesthetics as the visual language you use to
communicate your data.

The Basic Idea: You're translating numbers and categories into visual elements that humans can understand
at a glance.

Aesthetics and Types of Data - The Right Tool for the Right Job
Let me explain each aesthetic and when to use it:

1. Position - The Most Powerful Aesthetic

Best for: Quantitative and ordinal data

Why it works: Our brains are extremely good at judging position along a scale. We can easily compare "this
point is higher than that point."

Examples:

● Scatter plot: X-axis = study hours, Y-axis = exam scores. Position tells you both values
simultaneously.
● Bar chart: Height of bars represents sales revenue
● Line chart: Position over time shows stock prices

Data Types:

● Quantitative: Temperature (0°C, 15°C, 30°C) - actual measurable values

● Ordinal: Education level (High School < Bachelor's < Master's < PhD) - ordered categories

2. Color - The Versatile Communicator

Two Different Uses:

For Categorical Data (Nominal/Ordinal): Different hues for different categories

● Example: Red = Product A, Blue = Product B, Green = Product C

● Each product line gets its own color in a multi-line chart
● Political maps: Red states vs Blue states

For Continuous Data: Gradients (color intensity)

● Example: Heat map showing temperature - light yellow (cool) → orange → dark red (hot)
● Darker/more intense colors = higher values
● Like a weather map showing temperature gradients

Pro Tip: Use color thoughtfully - too many colors can be overwhelming, and about 8% of men have color
blindness.

3. Size - Showing Magnitude

Best for: Quantitative data (showing "how much")

Examples:

● Bubble chart: Bubble size represents population or market size

● Each country is a bubble; larger bubbles = larger populations
● Scatter plot with sized points: Size shows sales volume while position shows price and profit

Why it works: Bigger = more is intuitive to everyone

Caution: Our brains judge area, not diameter. A circle with 2× the diameter has 4× the area, which can be
misleading if not careful.

4. Shape - Distinguishing Categories

Best for: Nominal data (categories with no inherent order)

Examples:

● Scatter plot: Circles = Male customers, Triangles = Female customers, Squares = Non-binary
● Line chart: Solid line = Actual sales, Dashed line = Forecasted sales, Dotted line = Target

Limitation: Humans can only distinguish about 5-6 different shapes reliably. Beyond that, it becomes
confusing.

Best Practice: Often combined with color for redundancy (circle + blue, triangle + red), which helps colorblind
users.
5. Line/Area - Showing Change and Accumulation

Best for:

● Time series (changes over time)

● Aggregation (showing totals or cumulative values)

Line Examples:

● Stock price over time

● Temperature throughout the day
● Website traffic month by month

Area Examples:

● Stacked area chart: Show how different product categories contribute to total revenue over time
● Area under curve: Emphasizes the magnitude of change, not just the trend

Why lines for time: Our eyes naturally follow lines to see progression and trends. The slope tells us the rate
of change.

Scales - The Translation Mechanism

What Are Scales?

A scale is the bridge between your data values and visual properties. It's the function that converts "45
degrees Celsius" into "this specific position on the chart" or "Category A" into "blue color."

Types of Scales

Continuous Scale: Maps numerical ranges to visual ranges

● Example: Temperature data from 0°C to 100°C maps to a color gradient from blue (cold) to red (hot)
● Income from $0 to $200,000 maps to bar heights from 0 pixels to 400 pixels

Categorical Scale: Assigns distinct visual properties to distinct categories

● Example: Product categories (Electronics, Clothing, Food) map to distinct colors (Blue, Green,
Orange)
● Countries map to distinct shapes on a scatter plot

Why Scales Matter

Example of Good Scale Use: Imagine visualizing global temperatures. If you use a continuous color scale
from light blue (cold) through white (moderate) to dark red (hot), viewers immediately understand the pattern.

Example of Bad Scale Use: If you used random colors (purple = cold, yellow = medium, green = hot), it would
be confusing because these colors don't have intuitive temperature associations.

Avoiding Misinterpretation:

● If your data ranges from 95 to 100, but your axis starts at 0, the differences might look tiny
● If your axis starts at 95, the same differences look dramatic
● Both are technically correct, but tell very different stories - your scale choice matters ethically

4. Coordinate Systems and Axes

What Are Coordinate Systems?

Think of coordinate systems as the graph paper on which you draw your data. They provide the framework
that gives context to positions.

Cartesian Coordinates - The Default Choice

What It Is

The familiar X and Y axes at right angles - the graph system you learned in school.

Structure:

● X-axis (horizontal): Usually the independent variable or categories

● Y-axis (vertical): Usually the dependent variable or measurements

When to Use It

Scatter plots: Comparing two quantitative variables

● Example: Height vs Weight, Study Hours vs Exam Scores

Bar charts: Comparing categories

● Example: Sales by product category, Population by country

Line charts: Showing change over time

● Example: Stock prices over months, Temperature throughout a day

Why It's Intuitive

For Quantitative Data: We naturally understand "higher = more" and "further right = more"

For Categorical Data: Easy to compare heights of bars or positions of points

Example: A bar chart of monthly revenue - each month is a position on the X-axis, revenue height on Y-axis.
You instantly see which months had higher sales.

Nonlinear Axes - For Special Data Patterns

The Problem with Linear Axes

Sometimes your data has huge ranges or exponential growth, making linear axes impractical.
Logarithmic Scales

When to use: Data that spans multiple orders of magnitude

Example 1 - Population Growth:

● Year 1900: 1 billion people

● Year 2000: 6 billion people
● Year 2024: 8 billion people

On a linear scale, the early growth looks flat, and all the action is squeezed at the top. On a logarithmic
scale, each step represents a multiplication (×10), so growth patterns become clear across the entire timeline.

Example 2 - Income Distribution:

● Most people earn $30,000-$100,000

● Some earn $1 million
● A few earn $100 million

A linear scale would waste 99% of the space on the ultra-wealthy, making the majority's differences invisible. A
log scale shows meaningful patterns across all income levels.

Richter Scale for Earthquakes

This is a famous real-world use of logarithmic scaling:

● Magnitude 4: Minor earthquake

● Magnitude 5: 10× more powerful
● Magnitude 6: 100× more powerful than magnitude 4
● Magnitude 7: 1,000× more powerful than magnitude 4

Why it works: Earthquakes vary by factors of millions. A log scale compresses this huge range into a
readable scale.

Benefit: "Helps in compressing large ranges into readable visualizations" - you can see patterns across vastly
different magnitudes.

Coordinate Systems with Curved Axes

Sometimes straight axes aren't the best choice for representing relationships in your data.

Polar Coordinates

Structure:

● Instead of X and Y, you have angle (direction) and radius (distance from center)
● Think of it like a dartboard or radar screen

When to use:

Pie Charts: The most common use

● Angle of each slice = proportion of total

● Example: Market share by company, Budget allocation by department
Radar/Spider Charts:

● Multiple variables arranged in a circle

● Example: Player stats in a video game (Speed, Strength, Intelligence, etc.)
● Each axis radiates from the center

When data is cyclical:

● Time of day: 00:00 connects back to 24:00 - it's a cycle

● Seasons: Winter → Spring → Summer → Fall → Winter (loops back)
● Wind direction: North, East, South, West, North (circular)

Geographic Coordinate Systems

What they are: Latitude and longitude mapping

Purpose: Representing spatial data on Earth's surface

Examples:

● Choropleth maps: Countries colored by GDP, disease prevalence, election results

● Point maps: Cities marked with dots sized by population
● Heat maps: Crime density across neighborhoods

Why curved axes: Earth is a sphere (roughly), so flat maps require projection systems. Different projections
preserve different properties:

● Mercator: Preserves angles (good for navigation)

● Equal-area: Preserves size (good for comparing land areas)

Effective for spatial relationships: When your data has a geographic component (where things happen
matters), these coordinate systems are essential.

Putting It All Together - A Complete Example

Let's visualize "Global CO₂ Emissions by Country Over Time":

Data Type:

● Country (categorical)
● Year (quantitative, temporal)
● CO₂ emissions (quantitative)

Aesthetic Mappings:

● Position (X-axis): Year (time series)

● Position (Y-axis): CO₂ emissions
● Color: Different colors for different countries
● Line: Connect points over time to show trends

Coordinate System: Cartesian (X-Y axes)

Scale Choice:
● Y-axis could be linear (if emissions are similar) or logarithmic (if some countries emit 100× more than
others)

Result: A multi-line chart where you can compare emissions trends across countries over time, seeing which
are increasing, decreasing, or stable.

Key Takeaways
1. Aesthetics are your visual vocabulary - position, color, size, shape, lines are how you "speak" data
to your audience
2. Match aesthetics to data types - don't use color gradients for categories or distinct colors for
continuous values
3. Scales translate data to visuals - they're the invisible machinery that makes your chart work
4. Coordinate systems are your framework - Cartesian for most uses, polar for cyclical data,
geographic for spatial data, logarithmic for huge ranges
5. Thoughtful choices prevent misinterpretation - the wrong aesthetic or scale can mislead, even
unintentionally

Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Module 3 Completed
No ratings yet
Module 3 Completed
14 pages
Ics054 Unit 1
No ratings yet
Ics054 Unit 1
14 pages
Introduction To Data Science Module 2
No ratings yet
Introduction To Data Science Module 2
35 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
15 pages
Lesson 1. Introduction To Data Wrangling
No ratings yet
Lesson 1. Introduction To Data Wrangling
56 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Chapter 10 - Data at Scale
No ratings yet
Chapter 10 - Data at Scale
29 pages
Data Munging
No ratings yet
Data Munging
27 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Khoa 3
No ratings yet
Khoa 3
4 pages
All About Data Science
No ratings yet
All About Data Science
35 pages
Download
No ratings yet
Download
4 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Data Analysis & Business Insights
No ratings yet
Data Analysis & Business Insights
2 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
12 pages
Ds Unit 3 Notes
No ratings yet
Ds Unit 3 Notes
29 pages
Module 3-Part-1
No ratings yet
Module 3-Part-1
8 pages
Unit 1 Understanding Big Data
No ratings yet
Unit 1 Understanding Big Data
17 pages
Module 2
No ratings yet
Module 2
70 pages
Understanding Data Analytics Essentials
No ratings yet
Understanding Data Analytics Essentials
15 pages
Introduction To Big Data Platform (Module-3)
No ratings yet
Introduction To Big Data Platform (Module-3)
23 pages
Data Science Notes
No ratings yet
Data Science Notes
65 pages
Data Collection & Preprocessing Techniques
No ratings yet
Data Collection & Preprocessing Techniques
64 pages
Data Science Unit-1
No ratings yet
Data Science Unit-1
32 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Business Undestanding and Data Collection
No ratings yet
Business Undestanding and Data Collection
27 pages
Module 2 Data Science
No ratings yet
Module 2 Data Science
28 pages
Data Source
No ratings yet
Data Source
7 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
BA Unit 1
No ratings yet
BA Unit 1
38 pages
Business Analytics Overview and Techniques
No ratings yet
Business Analytics Overview and Techniques
37 pages
DAVAI Macro
No ratings yet
DAVAI Macro
6 pages
Module 5
No ratings yet
Module 5
29 pages
Ccs334 Unit 1
No ratings yet
Ccs334 Unit 1
44 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
UNIT I Notes
No ratings yet
UNIT I Notes
28 pages
BigData Notes
No ratings yet
BigData Notes
88 pages
Understanding Data Analytics and Its Importance
No ratings yet
Understanding Data Analytics and Its Importance
5 pages
Data Fundamentals
No ratings yet
Data Fundamentals
21 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
17 pages
Ai Project Life Cycle-2
No ratings yet
Ai Project Life Cycle-2
20 pages
Class IX - Chapter 2 AI Project Cycle Notes
57% (7)
Class IX - Chapter 2 AI Project Cycle Notes
11 pages
1 Da
No ratings yet
1 Da
44 pages
22dsb3303a Lecture Notes
No ratings yet
22dsb3303a Lecture Notes
108 pages
SQL ANalyst by CT Taylor Part 3
No ratings yet
SQL ANalyst by CT Taylor Part 3
5 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
232 pages
Business Intelligence Notes
No ratings yet
Business Intelligence Notes
27 pages
DM Unit I
No ratings yet
DM Unit I
52 pages
MTE Question Paper
No ratings yet
MTE Question Paper
9 pages
Synopsis Format 2025-26
No ratings yet
Synopsis Format 2025-26
2 pages
Deco
No ratings yet
Deco
18 pages
R Studio
No ratings yet
R Studio
3 pages
AWS Testing and QA Processes Overview
No ratings yet
AWS Testing and QA Processes Overview
4 pages
Day To Day Activities of Oracle DBA - Checklist
67% (3)
Day To Day Activities of Oracle DBA - Checklist
5 pages
Akhil - Data Analyst
No ratings yet
Akhil - Data Analyst
4 pages
B.C.a. Computer Applications Major
No ratings yet
B.C.a. Computer Applications Major
98 pages
Data Architect Nanodegree Syllabus
No ratings yet
Data Architect Nanodegree Syllabus
15 pages
Dbms20222-M14-A1-11-Rizkysuryaalfarizy - (SQL Oracle Developer)
No ratings yet
Dbms20222-M14-A1-11-Rizkysuryaalfarizy - (SQL Oracle Developer)
14 pages
16 - Microsoft PL-300 Exam - Questions and Answers
No ratings yet
16 - Microsoft PL-300 Exam - Questions and Answers
9 pages
Placement Information System
67% (3)
Placement Information System
51 pages
Product Lifecycle
No ratings yet
Product Lifecycle
36 pages
Data Structure Work Unesr
No ratings yet
Data Structure Work Unesr
14 pages
Java Fullstack Developer Profile
No ratings yet
Java Fullstack Developer Profile
11 pages
Hyperledger Fabric Summary+MCQ
No ratings yet
Hyperledger Fabric Summary+MCQ
30 pages
Assignment: Regression: Problem Statement
No ratings yet
Assignment: Regression: Problem Statement
3 pages
Ontology-Based KBS for Business
No ratings yet
Ontology-Based KBS for Business
13 pages
Mastering The Web - A Full Stack Development Adventure With Java, Spring Boot, and React
No ratings yet
Mastering The Web - A Full Stack Development Adventure With Java, Spring Boot, and React
21 pages
Rajini
No ratings yet
Rajini
3 pages
Salesforce Apex Developer Guide
No ratings yet
Salesforce Apex Developer Guide
742 pages
Online Job Consultancy Website in ASP
No ratings yet
Online Job Consultancy Website in ASP
5 pages
Expert Answer: Decision Support System (DSS)
No ratings yet
Expert Answer: Decision Support System (DSS)
2 pages
5 Sem DWM
No ratings yet
5 Sem DWM
11 pages
NIT Patna Resume 1
No ratings yet
NIT Patna Resume 1
1 page
GitHub - 2ndquadrant - Repmgrimprovements in Repmgr 3.1
No ratings yet
GitHub - 2ndquadrant - Repmgrimprovements in Repmgr 3.1
3 pages
P2 - ER Diagram
No ratings yet
P2 - ER Diagram
2 pages
Online Book Store
No ratings yet
Online Book Store
77 pages
Icitss It Module 2
83% (6)
Icitss It Module 2
457 pages
PB1 User Training SAP Business One and SQL Queries
100% (1)
PB1 User Training SAP Business One and SQL Queries
14 pages
SailPoint IdentityIQ 8.4 Installation Guide
No ratings yet
SailPoint IdentityIQ 8.4 Installation Guide
7 pages
It3401 Web Essentials Record
No ratings yet
It3401 Web Essentials Record
46 pages
Guidelines For Undertaking System Safety Analysis Ssa
No ratings yet
Guidelines For Undertaking System Safety Analysis Ssa
6 pages
ArcSight Architecture
No ratings yet
ArcSight Architecture
10 pages
Introduction To N8N: N8N (N-Eight-N) Open-Source Workflow Automation Tool Apps, Apis, and Databases
No ratings yet
Introduction To N8N: N8N (N-Eight-N) Open-Source Workflow Automation Tool Apps, Apis, and Databases
3 pages

DVT Unit 2

Uploaded by

DVT Unit 2

Uploaded by

Word Clouds

Also known as aTag Cloud.

-​ It is used in industries such as finance, pharmaceuticals, social media, and research.

1. Single Source Data Collection

Advantages - Why Use Single Source?

Limitations - The Downside

2. Multiple Sources Data Collection

Weather data (temperature, rainfall, humidity) from meteorological services

Advantages - The Power of Multiple Sources

Challenges - The Complications

Tools for Web Scraping

Python Libraries - The three most popular:

Important Considerations - The Rules of the Game

Two Main Approaches:

Impute: Fill in missing values intelligently.

●​ Mean/Median imputation: Replace missing ages with the average age

3. Correcting Inconsistent Formats

●​ "01/15/2024" vs "2024-01-15" vs "15-Jan-24" vs "January 15, 2024"

●​ "$1,500.00" vs "1500 USD" vs "₹1,12,500"

●​ "5 km" vs "5000 m" vs "3.1 miles"

1. Summarization - Rolling Up Data

Example: Daily to Monthly Sales

2. Grouping - Comparing Categories

Example: Average Test Scores per Department

The Pre-processing Pipeline

1.​ Collect raw data (via web scraping or other methods)

Raw Data Issues:

●​ Some reviews missing star ratings

1.​ Impute missing ratings with the product average

3. Mapping Data onto Aesthetics

1. Position - The Most Powerful Aesthetic

Best for: Quantitative and ordinal data

●​ Quantitative: Temperature (0°C, 15°C, 30°C) - actual measurable values

2. Color - The Versatile Communicator

Two Different Uses:

For Categorical Data (Nominal/Ordinal): Different hues for different categories

●​ Example: Red = Product A, Blue = Product B, Green = Product C

For Continuous Data: Gradients (color intensity)

3. Size - Showing Magnitude

Best for: Quantitative data (showing "how much")

●​ Bubble chart: Bubble size represents population or market size

Why it works: Bigger = more is intuitive to everyone

4. Shape - Distinguishing Categories

Best for: Nominal data (categories with no inherent order)

●​ Time series (changes over time)

●​ Stock price over time

Scales - The Translation Mechanism

Continuous Scale: Maps numerical ranges to visual ranges

Categorical Scale: Assigns distinct visual properties to distinct categories

Why Scales Matter

4. Coordinate Systems and Axes

Cartesian Coordinates - The Default Choice

●​ X-axis (horizontal): Usually the independent variable or categories

Scatter plots: Comparing two quantitative variables

●​ Example: Height vs Weight, Study Hours vs Exam Scores

Bar charts: Comparing categories

●​ Example: Sales by product category, Population by country

Line charts: Showing change over time

●​ Example: Stock prices over months, Temperature throughout a day

Why It's Intuitive

For Categorical Data: Easy to compare heights of bars or positions of points

Nonlinear Axes - For Special Data Patterns

When to use: Data that spans multiple orders of magnitude

Example 1 - Population Growth:

●​ Year 1900: 1 billion people

Example 2 - Income Distribution:

●​ Most people earn $30,000-$100,000

Richter Scale for Earthquakes

This is a famous real-world use of logarithmic scaling:

●​ Magnitude 4: Minor earthquake

Coordinate Systems with Curved Axes

Pie Charts: The most common use

●​ Angle of each slice = proportion of total

●​ Multiple variables arranged in a circle

When data is cyclical:

●​ Time of day: 00:00 connects back to 24:00 - it's a cycle

- It is used in industries such as finance, pharmaceuticals, social media, and research.

● Mean/Median imputation: Replace missing ages with the average age

● "01/15/2024" vs "2024-01-15" vs "15-Jan-24" vs "January 15, 2024"

● "$1,500.00" vs "1500 USD" vs "₹1,12,500"

● "5 km" vs "5000 m" vs "3.1 miles"

1. Collect raw data (via web scraping or other methods)

● Some reviews missing star ratings

1. Impute missing ratings with the product average

● Quantitative: Temperature (0°C, 15°C, 30°C) - actual measurable values

● Example: Red = Product A, Blue = Product B, Green = Product C

● Bubble chart: Bubble size represents population or market size

● Time series (changes over time)

● Stock price over time

● X-axis (horizontal): Usually the independent variable or categories

● Example: Height vs Weight, Study Hours vs Exam Scores

● Example: Sales by product category, Population by country

● Example: Stock prices over months, Temperature throughout a day

● Year 1900: 1 billion people

● Most people earn $30,000-$100,000

● Magnitude 4: Minor earthquake

● Angle of each slice = proportion of total

● Multiple variables arranged in a circle

● Time of day: 00:00 connects back to 24:00 - it's a cycle

● Choropleth maps: Countries colored by GDP, disease prevalence, election results

● Mercator: Preserves angles (good for navigation)

● Position (X-axis): Year (time series)