0% found this document useful (0 votes)

24 views33 pages

Chapter 3 - Data Collection 1

Uploaded by

bazeerahamed123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views33 pages

Chapter 3 - Data Collection 1

Uploaded by

bazeerahamed123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Data Collection

Objectives
At the end of this chapter, students will be able to

• understand basic types of data used in data science

• know the general/popular methods of data collection
• create their own datasets using different techniques
Types of data
There are several types of data that are commonly used
in data science and/or machine learning, including:
• Numeric Data: This type of data includes numerical
values, such as height, weight, age, temperature, or
price.
• Categorical Data: Categorical data includes variables
that are not numerical in nature, such as gender, race,
or color. This data can be nominal (no natural ordering)
or ordinal (ordered).
• Text Data: Text data is any type of data that is in the
form of natural language text. It includes documents,
articles, social media posts, and more.
Nominal Data examples
Nominal data is qualitative data that is categorized into
groups without any particular order or ranking.

• Gender (male, female, non-binary)

• Ethnicity (Asian, Black, Hispanic, White, etc.)
• Hair color (blonde, brown, black, red)
• Marital status (single, married, divorced, widowed)
• Favorite color (red, blue, green, etc.)
Ordinal Data examples
Ordinal data is qualitative or quantitative data that can be put in a
meaningful order or rank. However, the differences between the
categories are not necessarily the same, and the numerical values
do not have any inherent meaning.
• Education level (high school diploma, associate's degree,
bachelor's degree, master's degree, PhD)
• Income level (less than $25,000, $25,000 to $50,000, $50,000 to
$75,000, $75,000 to $100,000, more than $100,000)
• Customer satisfaction rating (very satisfied, somewhat satisfied,
neither satisfied nor dissatisfied, somewhat dissatisfied, very
dissatisfied)
• Ranking of sports teams (first, second, third, etc.)
Types of data
• Time Series Data: Time series data refers to data that is
collected over time, such as stock prices, weather data,
or sales data.
• Image Data: Image data is any type of data that is in
the form of images or pictures, such as photographs,
satellite images, or medical images.
• Audio Data: Audio data is any type of data that is in the
form of audio files, such as speech recordings, music
files, or sound effects.
Types of data
• Video Data: Video data is any type of data that is in the
form of video files, such as movies, television shows, or
surveillance footage.
• Each type of data requires different techniques and
tools to analyze and process effectively.
Identify Data Types
1.Data set: "Height of students in a classroom"
•What is the height of each student in the classroom?
•What is the average height of the students?
•What is the tallest and shortest height among the
students?

Answer:

2.Data set: "Fruit preferences of a group of people"

•What is each person's favorite fruit?
•How many people prefer apples over oranges?
•How many people do not like any type of fruit?

Answer:
Identify Data Types
1.Data set: "Height of students in a classroom"
•What is the height of each student in the classroom?
•What is the average height of the students?
•What is the tallest and shortest height among the
students?

Answer: Numeric

2.Data set: "Fruit preferences of a group of people"

•What is each person's favorite fruit?
•How many people prefer apples over oranges?
•How many people do not like any type of fruit?

Answer: Categorical
Identify Data Types
1.Data set: "Height of students in a classroom"
•What is the height of each student in the classroom?
•What is the average height of the students?
•What is the tallest and shortest height among the
students?

Answer:

2.Data set: "Fruit preferences of a group of people"

•What is each person's favorite fruit?
•How many people prefer apples over oranges?
•How many people do not like any type of fruit?

Answer:
Identify Data Types
3. Data set: "Rating of a movie on a scale of 1 to 5"
• What is the rating of the movie?
• How many people gave the movie a rating of 4 or higher?
• How does the rating of this movie compare to the rating of another movie?
• Answer:

4. Data set: "Social media posts about a particular topic"

• What are the key topics or themes mentioned in the posts?
• How many posts were made in total?
• What is the sentiment of the posts (positive, negative, neutral)?
• Answer:
Identify Data Types
3. Data set: "Rating of a movie on a scale of 1 to 5"
• What is the rating of the movie?
• How many people gave the movie a rating of 4 or higher?
• How does the rating of this movie compare to the rating of another movie?
• Answer: Ordinal data.

4. Data set: "Social media posts about a particular topic"

• What are the key topics or themes mentioned in the posts?
• How many posts were made in total?
• What is the sentiment of the posts (positive, negative, neutral)?
• Answer: Text data.
Sources

• Data can be obtained from various sources and using different mechanisms /
techniques
• Publicly available data
• Third party data providers
• Sensor data
• Customer data
• Proprietary data
Publicly available data
• There are many publicly available datasets that can be
used for data analytics. Some of the popular sources
include government websites, open data portals, and
academic repositories.
• Examples of such data include weather data, financial
data, and social media data.
• Also there are many data repositories like Kaggle which
provides data for analysis in different categories
• There are many specialized data repositories where we
can find the data available for research, analysis
Third-party data providers
• Some companies specialize in providing data for
analytics. These data providers can offer access to data
such as market data, consumer behavior data, and
demographic data.
• Some popular datasets can be made available for free,
for some datasets you may have to pay
Sensor data
• Sensors and IoT devices generate large amounts of
data, which can be used for analytics. Examples of
sensor data include environmental data, healthcare
data, and manufacturing data.
Customer data
• Businesses can collect customer data from various
sources such as CRM systems(Customer Relationship
Management), social media, and website analytics. This
data can be used to understand customer behavior,
preferences, and trends.
Proprietary data
• The data will be made available by the clients who
approach data science / analysis company for the
solutions related to their busisnesses
• This data will be shared mostly under contract terms
and usually confidential
Techniques to obtain data
• Surveys and questionnaires
• Web scraping
• Experimentation
• Observational studies
Surveys and questionnaires
• Surveys and questionnaires are a common way of
collecting data directly from individuals. These can be
conducted online, over the phone, or in person.
• It is tedious task to make people respond to surveys
seriously and thus many responses are casual
• This is still the most popular way of collecting data in
qualitative analysis
• Once collected, the data will be transformed into
numerical data
Web scraping
• Web scraping involves extracting data from websites
using automated tools. This technique can be used to
collect data such as product information, customer
reviews, and social media data.
• Some websites consider scraping as illegal method, so be
cautious when you are planning to scrape data from any
website
Experimentation
• Experimentation involves collecting data through
controlled experiments.
For example, businesses may run A/B tests to evaluate
the effectiveness of marketing campaigns.
• These are the techniques where you will be having
control on the operational apparatus in any system
• Another example can be for a university trying to find
the effectiveness of online teaching,
we can have some students studying in offline mode and
some of them are offered online teaching and then
collecting the data to analyze
Observational studies
• Observational studies involve collecting data by
observing and recording behavior.
For example, a researcher may observe how customers
interact with a product in a store.
Web Scraping described
• It is an automatic method to obtain large amounts of
data from websites.
• It also called web data mining or web harvesting, is the
process of constructing an agent which can extract,
parse, download, and organize useful information from
the web automatically.
https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com
%2Fforget-apis-do-python-scraping-using-beautiful-soup-import-data-file-from-the-web-
part-2-27af5d666246&psig=AOvVaw2XYf6R-
_Z9QTqbOszIlhSR&ust=1597329794966000&source=images&cd=vfe&ved=0CAIQjRxqFw
oTCKDXuJ3zlesCFQAAAAAdAAAAABAD
Implementing Web Scraping in
Python with BeautifulSoup
• Steps involved in web scraping:

• 1. Send an HTTP request to the URL of the webpage you want to access. The server
responds to the request by returning the HTML content of the webpage. For this task,
we will use a third-party HTTP library for python requests.

• 2. Once we have accessed the HTML content, we are left with the task of parsing the
data. Since most of the HTML data is nested, we cannot extract data simply through
string processing. One needs a parser which can create a nested/tree structure of the
HTML data. There are many HTML parser libraries available but the most advanced one
is html5lib.

• 3. Now, all we need to do is navigating and searching the parse tree that we created,
i.e. tree traversal. For this task, we will be using another third-party python library,
Beautiful Soup.
Web scraping example - Wikipedia
• Step 1: Installing the required third-party libraries requests, html5lib,
and bs4 (BeautifulSoup)
•Step 2: Accessing the HTML content from webpage
•Step 3: Parsing the HTML content
soup.prettify() is printed, it gives the visual representation of the parse tree
created from the raw HTML content.
•Step 4: Searching and navigating through the parse tree
• Now, we would like to extract some useful data from the HTML content.
The soup object contains all the data in the nested structure which could
be programmatically extracted.

• In our Example :
• We are scraping a webpage consisting of Asian counties. So, we would like
to create a program to save those Asian Countries relevant information.
Web Scraping - Twitter

• Twitter like many other Social Networking sites provide APIs which help users in building
the integrating software applications with available web services.

• Tweepy is an open source Python package that gives you a very convenient way to
access the Twitter API with Python.

• To access the twitter RESTful API methods, you must have Twitter Developer account. If
you don’t have one, please get one today at https://developer.twitter.com/en
Scraping Twitter Data using Tweepy

• When you get your own developer account, you will be given your access
token and access token secret key. It will also give you key and a secret key
to authenticate yourself to use twitter API.

• (Note: Don’t share your credentials with anyone)

Scraping Twitter Data using Tweepy
• import tweepy
• import pandas as pd

• # Authenticate to Twitter API using developer account credentials

• consumer_key = " XXXXXXXX" # your credentials
• consumer_secret = " XXXXXXXX "
• access_token = " XXXXXXXX "
• access_token_secret = " XXXXXXXX "

• auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

• auth.set_access_token(access_token, access_token_secret)
• api = tweepy.API(auth)
Scraping Twitter Data using Tweepy
# Define the search query and specify the number of tweets to download
search_query = "iphone 14 pro"
max_tweets = 1000
# Define an empty list to store the tweets
tweets_list = []
# Download tweets using Tweepy Cursor object
for tweet in tweepy.Cursor(api.search_tweets, q=search_query,
tweet_mode='extended').items(max_tweets):
# Extract the relevant tweet information from different parts of a tweet
tweet_text = tweet.full_text
tweet_date = tweet.created_at
tweet_user = tweet.user.screen_name
tweet_favorites = tweet.favorite_count
tweet_retweets = tweet.retweet_count
Scraping Twitter Data using Tweepy
# Append the tweet information to the tweets list
tweets_list.append([tweet_text, tweet_date,
tweet_user, tweet_favorites, tweet_retweets])
# Create a Pandas dataframe to store the tweets
tweets_df = pd.DataFrame(tweets_list, columns=['Text',
'Date', 'User', 'Favorites', 'Retweets'])

# Save the dataframe to a CSV file

tweets_df.to_csv('iphone14pro_tweets.csv', index=False)
Summary
• In conclusion, collecting data is an important step in
data analytics and data science. The sources of data
and techniques to obtain data are diverse and depend
on the specific problem being addressed. It is important
to ensure that the data collected is reliable, relevant,
and representative of the population of interest.
References

1. https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-
soup/?ref=rp

2. https://developer.twitter.com/en/docs/twitter-api/v1

3. https://developer.twitter.com/en

Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
472 Eb
No ratings yet
472 Eb
6 pages
How Data Is Col
No ratings yet
How Data Is Col
11 pages
Slide#3 - Understanding Data
No ratings yet
Slide#3 - Understanding Data
44 pages
4.0 Introduction To Data
No ratings yet
4.0 Introduction To Data
16 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
I. Data Collection What Is Data?
No ratings yet
I. Data Collection What Is Data?
12 pages
Unit 2
No ratings yet
Unit 2
37 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
Coursera - Data Analytics - Course 3
No ratings yet
Coursera - Data Analytics - Course 3
14 pages
417 AI Handbook Class9 Acquiring Processing Interpreting
No ratings yet
417 AI Handbook Class9 Acquiring Processing Interpreting
16 pages
DATA ANALYSIS - Full - Note - Immersive 2
No ratings yet
DATA ANALYSIS - Full - Note - Immersive 2
13 pages
ML Assignment 2
No ratings yet
ML Assignment 2
7 pages
Day 3 - Data Collection Methods, Toos and Data Classification
No ratings yet
Day 3 - Data Collection Methods, Toos and Data Classification
12 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
89 pages
BigDataAnalytics - Unit1
No ratings yet
BigDataAnalytics - Unit1
21 pages
FDS 4 Unit
No ratings yet
FDS 4 Unit
156 pages
CRM Data Collection and Storage
No ratings yet
CRM Data Collection and Storage
22 pages
Data Literacy
No ratings yet
Data Literacy
9 pages
Data Science Class X Notes
No ratings yet
Data Science Class X Notes
3 pages
Unit 2
No ratings yet
Unit 2
105 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Da Notes
No ratings yet
Da Notes
61 pages
Midterm Notes
No ratings yet
Midterm Notes
10 pages
Updated DM
No ratings yet
Updated DM
72 pages
DM Lec1 2
No ratings yet
DM Lec1 2
39 pages
Assignment 2 ML
No ratings yet
Assignment 2 ML
4 pages
Unit - 1 Notes - Introduction To Data-Analytics PDF
67% (3)
Unit - 1 Notes - Introduction To Data-Analytics PDF
106 pages
Unit 2
No ratings yet
Unit 2
54 pages
Data Source
No ratings yet
Data Source
7 pages
3 Data
No ratings yet
3 Data
23 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Unit 1
No ratings yet
Unit 1
85 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Data and Information
No ratings yet
Data and Information
6 pages
Data Analyst Work
No ratings yet
Data Analyst Work
22 pages
Unit 1
No ratings yet
Unit 1
34 pages
Data Acquistion 1
No ratings yet
Data Acquistion 1
22 pages
Data Types Cheat Sheet for Analysts
No ratings yet
Data Types Cheat Sheet for Analysts
4 pages
Module 5 Lecture Note
No ratings yet
Module 5 Lecture Note
8 pages
S&A Notes
No ratings yet
S&A Notes
5 pages
Lec 5
No ratings yet
Lec 5
1 page
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
Da Mod 1
No ratings yet
Da Mod 1
60 pages
DA Unit1 Notes
No ratings yet
DA Unit1 Notes
28 pages
Data Science Basics for Beginners
100% (2)
Data Science Basics for Beginners
68 pages
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
Data Analytics Lecture 3-1
No ratings yet
Data Analytics Lecture 3-1
23 pages
DA - Unit - 2-5-2025
No ratings yet
DA - Unit - 2-5-2025
109 pages
Data Science in Climate Change
No ratings yet
Data Science in Climate Change
164 pages
Data Analytics for CSE Students
No ratings yet
Data Analytics for CSE Students
91 pages
Statistical Learning - Introduction
No ratings yet
Statistical Learning - Introduction
20 pages
Datamining Lect1
No ratings yet
Datamining Lect1
59 pages
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
No ratings yet
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
33 pages
Data Sources
No ratings yet
Data Sources
9 pages
Observation Reconsidered
No ratings yet
Observation Reconsidered
22 pages
Review Questions
No ratings yet
Review Questions
5 pages
Transient Analysis with CLP
No ratings yet
Transient Analysis with CLP
27 pages
5.stowe86 ParsingWh-Constructions EvidenceForOn-LineGapLocation
No ratings yet
5.stowe86 ParsingWh-Constructions EvidenceForOn-LineGapLocation
10 pages
SS Chapter 2 Overview of Language Processor
No ratings yet
SS Chapter 2 Overview of Language Processor
15 pages
NLP Unit 1 Notes
No ratings yet
NLP Unit 1 Notes
5 pages
JavaScript Call Graph Algorithm Evaluation
No ratings yet
JavaScript Call Graph Algorithm Evaluation
76 pages
Syllabus: Master of Computer Applications (Mca)
No ratings yet
Syllabus: Master of Computer Applications (Mca)
31 pages
CGSplusLicenseManagement NetworkServerInstallation W2008x64
No ratings yet
CGSplusLicenseManagement NetworkServerInstallation W2008x64
18 pages
Compiler Construction Assignment Guide
0% (1)
Compiler Construction Assignment Guide
5 pages
AI Questions Answers
No ratings yet
AI Questions Answers
76 pages
Compiler Design Practicals
No ratings yet
Compiler Design Practicals
24 pages
Language Processing Explained
No ratings yet
Language Processing Explained
13 pages
PES22 Scheme V & VI Sem Syllabus
No ratings yet
PES22 Scheme V & VI Sem Syllabus
53 pages
Joy and Luck - 1999 - Plagiarism in Programming Assignments
No ratings yet
Joy and Luck - 1999 - Plagiarism in Programming Assignments
5 pages
123 Submission
No ratings yet
123 Submission
9 pages
Waf-A-Mole: Evading Web Application Firewalls Through Adversarial Machine Learning
No ratings yet
Waf-A-Mole: Evading Web Application Firewalls Through Adversarial Machine Learning
8 pages
Computer Science MCQs and Solutions
No ratings yet
Computer Science MCQs and Solutions
14 pages
SC 200t00a Enu Powerpoint 08
No ratings yet
SC 200t00a Enu Powerpoint 08
48 pages
Introduction To Compiler Design (CD) : Mu-Mit
No ratings yet
Introduction To Compiler Design (CD) : Mu-Mit
22 pages
Compiler Design Viva Questions & Answers
67% (3)
Compiler Design Viva Questions & Answers
9 pages
Human Associative Memory PDF
100% (1)
Human Associative Memory PDF
26 pages
Assembly Language and Parsing Techniques
No ratings yet
Assembly Language and Parsing Techniques
8 pages
NLP Notes
No ratings yet
NLP Notes
37 pages
BCS613C Model Set 1 Paper
No ratings yet
BCS613C Model Set 1 Paper
3 pages
What Is CL-NLP-1
No ratings yet
What Is CL-NLP-1
12 pages
KUK CSE 3rd Yr - SY
No ratings yet
KUK CSE 3rd Yr - SY
30 pages
Modern Speech Recognition Approa
No ratings yet
Modern Speech Recognition Approa
337 pages
D LR Parsing
No ratings yet
D LR Parsing
41 pages
CD Question Bank
No ratings yet
CD Question Bank
7 pages