0% found this document useful (0 votes)
7 views13 pages

DVT Unit 2

The document discusses various data visualization techniques, including Word Clouds and Time Series Data, emphasizing their applications and limitations. It outlines the steps for Exploratory Data Analysis (EDA) and data collection methods, such as Single and Multiple Sources, along with web scraping for automated data collection. Additionally, it covers data cleaning, aggregation, and the importance of aesthetics in visual representation, highlighting how to effectively map data onto visual elements.

Uploaded by

iron pump
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

DVT Unit 2

The document discusses various data visualization techniques, including Word Clouds and Time Series Data, emphasizing their applications and limitations. It outlines the steps for Exploratory Data Analysis (EDA) and data collection methods, such as Single and Multiple Sources, along with web scraping for automated data collection. Additionally, it covers data cleaning, aggregation, and the importance of aesthetics in visual representation, highlighting how to effectively map data onto visual elements.

Uploaded by

iron pump
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Word Clouds

Also known as aTag Cloud.

A visualisation method that displays how frequently words appear in a given body of text, by making the size of
each word proportional to its frequency. All the words are then arranged in a cluster or cloud of words.
Alternatively, the words can also be arranged in any format: horizontal lines, columns or within a shape.

Word Clouds can also be used to display words that have meta-data assigned to them. For example, in a
Word Cloud of all the World's countries, the population could be assigned to each country's name to determine
its size.

Colour used on Word Clouds is usually meaningless and is primarily aesthetic, but it can be used to categorise
words or to display another data variable.

Typically, Word Clouds are used on websites or blogs to depict keyword or tag usage. Word Clouds can also
be used to compare two different bodies of text together.

Although simple and easy to understand, Word Clouds have some major flaws:
-​ Long words are emphasised over short words.
-​ Words whose letters contain many ascenders and descenders may receive more attention.
-​ They're not great for analytical accuracy, so used more for aesthetic reasons instead.
Time Series Data
Time series data is information collected in sequence over time. It shows how things change at different points,
like stock prices every day or temperature every hour.

-​ It is used in industries such as finance, pharmaceuticals, social media, and research.


-​ Analyzing and visualizing this data helps us find trends, seasonal patterns, and behaviors.
-​ These insights support forecasting and guide better decision-making.
-​ The main goal is to study data in time order to extract meaningful patterns and predictions.

EDA :
Steps
1.​ Data understanding - Data types, why?
2.​ Data Collection
3.​ Data Cleaning - Handling Missing values and Inconsistencies
4.​ Visualization - univariate, bivariate, multivariate
5.​ Handling Outliers
6.​ Findings and insights

Types
1.​ Univariate
2.​ Bivariate
3.​ Multivariate
UNIT 2

DATE COLLECTION

1. Single Source Data Collection


What It Means
Single source means you're pulling all your data from one place - one database, one spreadsheet, one API, or
one system.
Example in Practice
Imagine you work for an e-commerce company and you need to analyze sales performance. You extract all
your data directly from the company's sales database - customer purchases, transaction amounts, dates,
product categories, etc. Everything comes from that one database.

Advantages - Why Use Single Source?


Easier Integration: Since everything comes from one place, you don't need to worry about combining different
data formats or structures. It's like using LEGO blocks from the same set - they all fit together perfectly.
Easier Cleaning: The data typically follows consistent formatting rules, has the same data types, and uses the
same conventions. For example, dates will all be in the same format (MM/DD/YYYY or DD/MM/YYYY),
currency will be consistent, etc.

Limitations - The Downside


Lack of Context: Your data might tell you WHAT happened but not WHY. For instance, your sales data shows a
spike in December, but without external data (like marketing campaign data or seasonal trends), you can't
explain why.
Lack of Completeness: You're limited to what that one source tracks. Your sales database might not include
customer sentiment, competitor pricing, economic indicators, or other factors that influence sales.

2. Multiple Sources Data Collection


What It Means
Multiple sources means you're combining data from different places to create a more complete picture. You're
essentially connecting different puzzle pieces from different boxes.
Example in Practice
An agricultural researcher wants to understand crop yields. They combine:

Weather data (temperature, rainfall, humidity) from meteorological services


Crop yield data (harvest quantities, quality grades) from farms
Soil data (pH levels, nutrients) from agricultural databases

By merging these, they can analyze how weather patterns affect crop production.

Advantages - The Power of Multiple Sources


Richer Insights: You can see relationships and patterns that wouldn't be visible with just one dataset. In the
agricultural example, you can correlate rainfall amounts with yield quality.
More Comprehensive: You get a 360-degree view of your subject. Instead of just knowing sales numbers, you
might combine them with customer reviews, social media sentiment, and competitor data.

Challenges - The Complications


Data Integration: Different sources use different formats. One dataset might have dates as "01/15/2024" while
another uses "2024-01-15" or "January 15, 2024". You need to standardize everything.
Matching Formats: Beyond dates, you might have different units (kilometers vs. miles), different naming
conventions (USA vs. United States), or different levels of granularity (daily vs. monthly data).
Handling Inconsistencies: Sources might contradict each other, have different update frequencies, or measure
things differently. For example, one source might count "active users" as anyone who logged in that month,
while another counts only those who made a purchase.
Web Scraping - Automated Data Collection
What Is Web Scraping?

Web scraping is like having a robot assistant that visits websites and copies information for you automatically.
Instead of manually copying product prices or article headlines, you write code that does it for you at scale.

Tools for Web Scraping

Python Libraries - The three most popular:

1.​ BeautifulSoup: Best for parsing HTML/XML. It's like a smart search tool that finds specific elements on
a webpage (like all product names or prices).
2.​ Selenium: Controls a real web browser automatically. It can click buttons, fill forms, and handle
websites that require user interaction. Think of it as a robot clicking through a website just like you
would.
3.​ Scrapy: A complete framework for large-scale scraping. It's like BeautifulSoup on steroids - designed
for crawling entire websites efficiently.

Real-World Applications

E-commerce Price Monitoring: Scraping Amazon, Flipkart, or other sites to track product prices over time.
You could visualize price trends, find the best time to buy, or compare competitor pricing.

News Aggregation: Extracting articles from multiple news portals to analyze trending topics, sentiment, or
media coverage patterns.

Review Analysis: Gathering customer reviews and ratings from multiple platforms to understand product
sentiment, identify common complaints, or track brand reputation.

Important Considerations - The Rules of the Game

Ethical/Legal Boundaries: Not all websites allow scraping. Some explicitly prohibit it in their Terms of Service.
Always check before scraping. Just because you can access data doesn't mean you should.

robots.txt: This is a file websites use to tell scrapers which parts of the site they can access. For example,
amazon.com/robots.txt tells you what Amazon allows. Respecting this is both ethical and legal.

Server Load: If you send too many requests too quickly, you can overwhelm a website's servers (similar to a
DDoS attack). Always add delays between requests and be respectful of the website's resources.

Dynamic Content Handling: Modern websites load content with JavaScript. If you scrape the initial HTML,
you might get an empty page. This is where tools like Selenium become necessary - they wait for the
JavaScript to execute and content to load.
Data Cleaning - Removing Noise and Errors
1. Handling Missing Values

The Problem: Real-world data often has gaps. A survey respondent skips a question, a sensor fails, or a
database field is left blank.

Two Main Approaches:

Drop: Simply remove rows or columns with missing data. Use this when you have plenty of data and the
missing values are random.

●​ Example: If only 5 out of 10,000 customer records lack email addresses, just drop those 5.

Impute: Fill in missing values intelligently.

●​ Mean/Median imputation: Replace missing ages with the average age


●​ Forward fill: Use the last known value (good for time series)
●​ Predictive imputation: Use other columns to predict the missing value

Example: If you're visualizing temperature trends and one day's reading is missing, you might impute it as the
average of the day before and after.

2. Removing Duplicates

The Problem: The same record appears multiple times, which skews your analysis and visualizations.

Example: A customer accidentally submitted the same order twice, or data was imported multiple times. If you
visualize total sales without removing duplicates, your revenue chart will be inflated.

Solution: Identify duplicate rows based on key fields (like customer ID + timestamp) and keep only one copy.

3. Correcting Inconsistent Formats

The Problem: The same information is represented differently, making it impossible to group or compare.

Common Examples:

Dates:

●​ "01/15/2024" vs "2024-01-15" vs "15-Jan-24" vs "January 15, 2024"


●​ Solution: Standardize to one format, like ISO 8601 (YYYY-MM-DD)

Currency:

●​ "$1,500.00" vs "1500 USD" vs "₹1,12,500"


●​ Solution: Convert to numeric values with a consistent currency

Units:

●​ "5 km" vs "5000 m" vs "3.1 miles"


●​ Solution: Convert all to the same unit

Why It Matters for Visualization: If you're creating a timeline chart and your dates are in different formats, the
system won't recognize them as dates and your chart will break.
Data Aggregation - Combining Data at Different Levels
What Is Aggregation?

Aggregation means summarizing detailed data into higher-level summaries. It's like zooming out to see the
bigger picture instead of individual details.

1. Summarization - Rolling Up Data

Example: Daily to Monthly Sales

●​ Raw data: Sales for every single day (Jan 1: $500, Jan 2: $750, Jan 3: $600...)
●​ Aggregated: Total sales for January: $45,000

Why Do This?

●​ Reduces Complexity: A chart with 365 daily points is cluttered; 12 monthly points is clearer
●​ Makes Patterns Visible: Daily fluctuations might hide the overall upward trend that becomes obvious
monthly

2. Grouping - Comparing Categories

Example: Average Test Scores per Department

●​ Raw data: Individual student scores (Student A: 85, Student B: 92, Student C: 78...)
●​ Grouped: Computer Science avg: 87, Mathematics avg: 82, Physics avg: 90

Why Do This? Enables comparison between categories. You can create a bar chart comparing departments,
which would be impossible with individual student data.

Benefits of Aggregation

Reduces Complexity: Instead of visualizing 1 million individual transactions, you show monthly totals - much
more digestible.

Makes Patterns Visible: Noise in individual data points can obscure trends. Aggregation smooths out random
variations and reveals the underlying patterns.

Performance: Visualizing aggregated data is faster and uses less memory than raw, detailed data.

The Pre-processing Pipeline


Here's how these steps typically flow:

1.​ Collect raw data (via web scraping or other methods)


2.​ Clean it (handle missing values, remove duplicates, fix formats)
3.​ Aggregate it (summarize or group as needed)
4.​ Visualize it (now your data is ready for charts and graphs!)
Practical Example - Putting It All Together
Let's say you scraped product reviews from an e-commerce site:

Raw Data Issues:

●​ Some reviews missing star ratings


●​ Duplicate reviews from bots
●​ Dates in different formats
●​ Prices in different currencies

Pre-processing Steps:

1.​ Impute missing ratings with the product average


2.​ Remove duplicate reviews based on user ID + timestamp
3.​ Standardize all dates to YYYY-MM-DD
4.​ Convert all prices to USD
5.​ Aggregate reviews by month to see sentiment trends

Now you can visualize: A line chart showing average monthly ratings over time, revealing that ratings
dropped after a price increase.

3. Mapping Data onto Aesthetics


The Core Concept

"Aesthetics" in data visualization doesn't mean "making things pretty" - it refers to visual properties that your
eyes can perceive and your brain can interpret. Think of aesthetics as the visual language you use to
communicate your data.

The Basic Idea: You're translating numbers and categories into visual elements that humans can understand
at a glance.

Aesthetics and Types of Data - The Right Tool for the Right Job
Let me explain each aesthetic and when to use it:

1. Position - The Most Powerful Aesthetic

Best for: Quantitative and ordinal data

Why it works: Our brains are extremely good at judging position along a scale. We can easily compare "this
point is higher than that point."

Examples:

●​ Scatter plot: X-axis = study hours, Y-axis = exam scores. Position tells you both values
simultaneously.
●​ Bar chart: Height of bars represents sales revenue
●​ Line chart: Position over time shows stock prices

Data Types:

●​ Quantitative: Temperature (0°C, 15°C, 30°C) - actual measurable values


●​ Ordinal: Education level (High School < Bachelor's < Master's < PhD) - ordered categories

2. Color - The Versatile Communicator

Two Different Uses:

For Categorical Data (Nominal/Ordinal): Different hues for different categories

●​ Example: Red = Product A, Blue = Product B, Green = Product C


●​ Each product line gets its own color in a multi-line chart
●​ Political maps: Red states vs Blue states

For Continuous Data: Gradients (color intensity)

●​ Example: Heat map showing temperature - light yellow (cool) → orange → dark red (hot)
●​ Darker/more intense colors = higher values
●​ Like a weather map showing temperature gradients

Pro Tip: Use color thoughtfully - too many colors can be overwhelming, and about 8% of men have color
blindness.

3. Size - Showing Magnitude

Best for: Quantitative data (showing "how much")

Examples:

●​ Bubble chart: Bubble size represents population or market size


●​ Each country is a bubble; larger bubbles = larger populations
●​ Scatter plot with sized points: Size shows sales volume while position shows price and profit

Why it works: Bigger = more is intuitive to everyone

Caution: Our brains judge area, not diameter. A circle with 2× the diameter has 4× the area, which can be
misleading if not careful.

4. Shape - Distinguishing Categories

Best for: Nominal data (categories with no inherent order)

Examples:

●​ Scatter plot: Circles = Male customers, Triangles = Female customers, Squares = Non-binary
●​ Line chart: Solid line = Actual sales, Dashed line = Forecasted sales, Dotted line = Target

Limitation: Humans can only distinguish about 5-6 different shapes reliably. Beyond that, it becomes
confusing.

Best Practice: Often combined with color for redundancy (circle + blue, triangle + red), which helps colorblind
users.
5. Line/Area - Showing Change and Accumulation

Best for:

●​ Time series (changes over time)


●​ Aggregation (showing totals or cumulative values)

Line Examples:

●​ Stock price over time


●​ Temperature throughout the day
●​ Website traffic month by month

Area Examples:

●​ Stacked area chart: Show how different product categories contribute to total revenue over time
●​ Area under curve: Emphasizes the magnitude of change, not just the trend

Why lines for time: Our eyes naturally follow lines to see progression and trends. The slope tells us the rate
of change.

Scales - The Translation Mechanism


What Are Scales?

A scale is the bridge between your data values and visual properties. It's the function that converts "45
degrees Celsius" into "this specific position on the chart" or "Category A" into "blue color."

Types of Scales

Continuous Scale: Maps numerical ranges to visual ranges

●​ Example: Temperature data from 0°C to 100°C maps to a color gradient from blue (cold) to red (hot)
●​ Income from $0 to $200,000 maps to bar heights from 0 pixels to 400 pixels

Categorical Scale: Assigns distinct visual properties to distinct categories

●​ Example: Product categories (Electronics, Clothing, Food) map to distinct colors (Blue, Green,
Orange)
●​ Countries map to distinct shapes on a scatter plot

Why Scales Matter

Example of Good Scale Use: Imagine visualizing global temperatures. If you use a continuous color scale
from light blue (cold) through white (moderate) to dark red (hot), viewers immediately understand the pattern.

Example of Bad Scale Use: If you used random colors (purple = cold, yellow = medium, green = hot), it would
be confusing because these colors don't have intuitive temperature associations.

Avoiding Misinterpretation:

●​ If your data ranges from 95 to 100, but your axis starts at 0, the differences might look tiny
●​ If your axis starts at 95, the same differences look dramatic
●​ Both are technically correct, but tell very different stories - your scale choice matters ethically

4. Coordinate Systems and Axes


What Are Coordinate Systems?

Think of coordinate systems as the graph paper on which you draw your data. They provide the framework
that gives context to positions.

Cartesian Coordinates - The Default Choice


What It Is

The familiar X and Y axes at right angles - the graph system you learned in school.

Structure:

●​ X-axis (horizontal): Usually the independent variable or categories


●​ Y-axis (vertical): Usually the dependent variable or measurements

When to Use It

Scatter plots: Comparing two quantitative variables

●​ Example: Height vs Weight, Study Hours vs Exam Scores

Bar charts: Comparing categories

●​ Example: Sales by product category, Population by country

Line charts: Showing change over time

●​ Example: Stock prices over months, Temperature throughout a day

Why It's Intuitive

For Quantitative Data: We naturally understand "higher = more" and "further right = more"

For Categorical Data: Easy to compare heights of bars or positions of points

Example: A bar chart of monthly revenue - each month is a position on the X-axis, revenue height on Y-axis.
You instantly see which months had higher sales.

Nonlinear Axes - For Special Data Patterns


The Problem with Linear Axes

Sometimes your data has huge ranges or exponential growth, making linear axes impractical.
Logarithmic Scales

When to use: Data that spans multiple orders of magnitude

Example 1 - Population Growth:

●​ Year 1900: 1 billion people


●​ Year 2000: 6 billion people
●​ Year 2024: 8 billion people

On a linear scale, the early growth looks flat, and all the action is squeezed at the top. On a logarithmic
scale, each step represents a multiplication (×10), so growth patterns become clear across the entire timeline.

Example 2 - Income Distribution:

●​ Most people earn $30,000-$100,000


●​ Some earn $1 million
●​ A few earn $100 million

A linear scale would waste 99% of the space on the ultra-wealthy, making the majority's differences invisible. A
log scale shows meaningful patterns across all income levels.

Richter Scale for Earthquakes

This is a famous real-world use of logarithmic scaling:

●​ Magnitude 4: Minor earthquake


●​ Magnitude 5: 10× more powerful
●​ Magnitude 6: 100× more powerful than magnitude 4
●​ Magnitude 7: 1,000× more powerful than magnitude 4

Why it works: Earthquakes vary by factors of millions. A log scale compresses this huge range into a
readable scale.

Benefit: "Helps in compressing large ranges into readable visualizations" - you can see patterns across vastly
different magnitudes.

Coordinate Systems with Curved Axes


Sometimes straight axes aren't the best choice for representing relationships in your data.

Polar Coordinates

Structure:

●​ Instead of X and Y, you have angle (direction) and radius (distance from center)
●​ Think of it like a dartboard or radar screen

When to use:

Pie Charts: The most common use

●​ Angle of each slice = proportion of total


●​ Example: Market share by company, Budget allocation by department
Radar/Spider Charts:

●​ Multiple variables arranged in a circle


●​ Example: Player stats in a video game (Speed, Strength, Intelligence, etc.)
●​ Each axis radiates from the center

When data is cyclical:

●​ Time of day: 00:00 connects back to 24:00 - it's a cycle


●​ Seasons: Winter → Spring → Summer → Fall → Winter (loops back)
●​ Wind direction: North, East, South, West, North (circular)

Geographic Coordinate Systems

What they are: Latitude and longitude mapping

Purpose: Representing spatial data on Earth's surface

Examples:

●​ Choropleth maps: Countries colored by GDP, disease prevalence, election results


●​ Point maps: Cities marked with dots sized by population
●​ Heat maps: Crime density across neighborhoods

Why curved axes: Earth is a sphere (roughly), so flat maps require projection systems. Different projections
preserve different properties:

●​ Mercator: Preserves angles (good for navigation)


●​ Equal-area: Preserves size (good for comparing land areas)

Effective for spatial relationships: When your data has a geographic component (where things happen
matters), these coordinate systems are essential.

Putting It All Together - A Complete Example


Let's visualize "Global CO₂ Emissions by Country Over Time":

Data Type:

●​ Country (categorical)
●​ Year (quantitative, temporal)
●​ CO₂ emissions (quantitative)

Aesthetic Mappings:

●​ Position (X-axis): Year (time series)


●​ Position (Y-axis): CO₂ emissions
●​ Color: Different colors for different countries
●​ Line: Connect points over time to show trends

Coordinate System: Cartesian (X-Y axes)

Scale Choice:
●​ Y-axis could be linear (if emissions are similar) or logarithmic (if some countries emit 100× more than
others)

Result: A multi-line chart where you can compare emissions trends across countries over time, seeing which
are increasing, decreasing, or stable.

Key Takeaways
1.​ Aesthetics are your visual vocabulary - position, color, size, shape, lines are how you "speak" data
to your audience
2.​ Match aesthetics to data types - don't use color gradients for categories or distinct colors for
continuous values
3.​ Scales translate data to visuals - they're the invisible machinery that makes your chart work
4.​ Coordinate systems are your framework - Cartesian for most uses, polar for cyclical data,
geographic for spatial data, logarithmic for huge ranges
5.​ Thoughtful choices prevent misinterpretation - the wrong aesthetic or scale can mislead, even
unintentionally

You might also like