Data Collection
Objectives
At the end of this chapter, students will be able to
• understand basic types of data used in data science
• know the general/popular methods of data collection
• create their own datasets using different techniques
Types of data
There are several types of data that are commonly used
in data science and/or machine learning, including:
• Numeric Data: This type of data includes numerical
values, such as height, weight, age, temperature, or
price.
• Categorical Data: Categorical data includes variables
that are not numerical in nature, such as gender, race,
or color. This data can be nominal (no natural ordering)
or ordinal (ordered).
• Text Data: Text data is any type of data that is in the
form of natural language text. It includes documents,
articles, social media posts, and more.
Nominal Data examples
Nominal data is qualitative data that is categorized into
groups without any particular order or ranking.
• Gender (male, female, non-binary)
• Ethnicity (Asian, Black, Hispanic, White, etc.)
• Hair color (blonde, brown, black, red)
• Marital status (single, married, divorced, widowed)
• Favorite color (red, blue, green, etc.)
Ordinal Data examples
Ordinal data is qualitative or quantitative data that can be put in a
meaningful order or rank. However, the differences between the
categories are not necessarily the same, and the numerical values
do not have any inherent meaning.
• Education level (high school diploma, associate's degree,
bachelor's degree, master's degree, PhD)
• Income level (less than $25,000, $25,000 to $50,000, $50,000 to
$75,000, $75,000 to $100,000, more than $100,000)
• Customer satisfaction rating (very satisfied, somewhat satisfied,
neither satisfied nor dissatisfied, somewhat dissatisfied, very
dissatisfied)
• Ranking of sports teams (first, second, third, etc.)
Types of data
• Time Series Data: Time series data refers to data that is
collected over time, such as stock prices, weather data,
or sales data.
• Image Data: Image data is any type of data that is in
the form of images or pictures, such as photographs,
satellite images, or medical images.
• Audio Data: Audio data is any type of data that is in the
form of audio files, such as speech recordings, music
files, or sound effects.
Types of data
• Video Data: Video data is any type of data that is in the
form of video files, such as movies, television shows, or
surveillance footage.
• Each type of data requires different techniques and
tools to analyze and process effectively.
Identify Data Types
1.Data set: "Height of students in a classroom"
•What is the height of each student in the classroom?
•What is the average height of the students?
•What is the tallest and shortest height among the
students?
Answer:
2.Data set: "Fruit preferences of a group of people"
•What is each person's favorite fruit?
•How many people prefer apples over oranges?
•How many people do not like any type of fruit?
Answer:
Identify Data Types
1.Data set: "Height of students in a classroom"
•What is the height of each student in the classroom?
•What is the average height of the students?
•What is the tallest and shortest height among the
students?
Answer: Numeric
2.Data set: "Fruit preferences of a group of people"
•What is each person's favorite fruit?
•How many people prefer apples over oranges?
•How many people do not like any type of fruit?
Answer: Categorical
Identify Data Types
1.Data set: "Height of students in a classroom"
•What is the height of each student in the classroom?
•What is the average height of the students?
•What is the tallest and shortest height among the
students?
Answer:
2.Data set: "Fruit preferences of a group of people"
•What is each person's favorite fruit?
•How many people prefer apples over oranges?
•How many people do not like any type of fruit?
Answer:
Identify Data Types
3. Data set: "Rating of a movie on a scale of 1 to 5"
• What is the rating of the movie?
• How many people gave the movie a rating of 4 or higher?
• How does the rating of this movie compare to the rating of another movie?
• Answer:
4. Data set: "Social media posts about a particular topic"
• What are the key topics or themes mentioned in the posts?
• How many posts were made in total?
• What is the sentiment of the posts (positive, negative, neutral)?
• Answer:
Identify Data Types
3. Data set: "Rating of a movie on a scale of 1 to 5"
• What is the rating of the movie?
• How many people gave the movie a rating of 4 or higher?
• How does the rating of this movie compare to the rating of another movie?
• Answer: Ordinal data.
4. Data set: "Social media posts about a particular topic"
• What are the key topics or themes mentioned in the posts?
• How many posts were made in total?
• What is the sentiment of the posts (positive, negative, neutral)?
• Answer: Text data.
Sources
• Data can be obtained from various sources and using different mechanisms /
techniques
• Publicly available data
• Third party data providers
• Sensor data
• Customer data
• Proprietary data
Publicly available data
• There are many publicly available datasets that can be
used for data analytics. Some of the popular sources
include government websites, open data portals, and
academic repositories.
• Examples of such data include weather data, financial
data, and social media data.
• Also there are many data repositories like Kaggle which
provides data for analysis in different categories
• There are many specialized data repositories where we
can find the data available for research, analysis
Third-party data providers
• Some companies specialize in providing data for
analytics. These data providers can offer access to data
such as market data, consumer behavior data, and
demographic data.
• Some popular datasets can be made available for free,
for some datasets you may have to pay
Sensor data
• Sensors and IoT devices generate large amounts of
data, which can be used for analytics. Examples of
sensor data include environmental data, healthcare
data, and manufacturing data.
Customer data
• Businesses can collect customer data from various
sources such as CRM systems(Customer Relationship
Management), social media, and website analytics. This
data can be used to understand customer behavior,
preferences, and trends.
Proprietary data
• The data will be made available by the clients who
approach data science / analysis company for the
solutions related to their busisnesses
• This data will be shared mostly under contract terms
and usually confidential
Techniques to obtain data
• Surveys and questionnaires
• Web scraping
• Experimentation
• Observational studies
Surveys and questionnaires
• Surveys and questionnaires are a common way of
collecting data directly from individuals. These can be
conducted online, over the phone, or in person.
• It is tedious task to make people respond to surveys
seriously and thus many responses are casual
• This is still the most popular way of collecting data in
qualitative analysis
• Once collected, the data will be transformed into
numerical data
Web scraping
• Web scraping involves extracting data from websites
using automated tools. This technique can be used to
collect data such as product information, customer
reviews, and social media data.
• Some websites consider scraping as illegal method, so be
cautious when you are planning to scrape data from any
website
Experimentation
• Experimentation involves collecting data through
controlled experiments.
For example, businesses may run A/B tests to evaluate
the effectiveness of marketing campaigns.
• These are the techniques where you will be having
control on the operational apparatus in any system
• Another example can be for a university trying to find
the effectiveness of online teaching,
we can have some students studying in offline mode and
some of them are offered online teaching and then
collecting the data to analyze
Observational studies
• Observational studies involve collecting data by
observing and recording behavior.
For example, a researcher may observe how customers
interact with a product in a store.
Web Scraping described
• It is an automatic method to obtain large amounts of
data from websites.
• It also called web data mining or web harvesting, is the
process of constructing an agent which can extract,
parse, download, and organize useful information from
the web automatically.
https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com
%2Fforget-apis-do-python-scraping-using-beautiful-soup-import-data-file-from-the-web-
part-2-27af5d666246&psig=AOvVaw2XYf6R-
_Z9QTqbOszIlhSR&ust=1597329794966000&source=images&cd=vfe&ved=0CAIQjRxqFw
oTCKDXuJ3zlesCFQAAAAAdAAAAABAD
Implementing Web Scraping in
Python with BeautifulSoup
• Steps involved in web scraping:
• 1. Send an HTTP request to the URL of the webpage you want to access. The server
responds to the request by returning the HTML content of the webpage. For this task,
we will use a third-party HTTP library for python requests.
• 2. Once we have accessed the HTML content, we are left with the task of parsing the
data. Since most of the HTML data is nested, we cannot extract data simply through
string processing. One needs a parser which can create a nested/tree structure of the
HTML data. There are many HTML parser libraries available but the most advanced one
is html5lib.
• 3. Now, all we need to do is navigating and searching the parse tree that we created,
i.e. tree traversal. For this task, we will be using another third-party python library,
Beautiful Soup.
Web scraping example - Wikipedia
• Step 1: Installing the required third-party libraries requests, html5lib,
and bs4 (BeautifulSoup)
•Step 2: Accessing the HTML content from webpage
•Step 3: Parsing the HTML content
soup.prettify() is printed, it gives the visual representation of the parse tree
created from the raw HTML content.
•Step 4: Searching and navigating through the parse tree
• Now, we would like to extract some useful data from the HTML content.
The soup object contains all the data in the nested structure which could
be programmatically extracted.
• In our Example :
• We are scraping a webpage consisting of Asian counties. So, we would like
to create a program to save those Asian Countries relevant information.
Web Scraping - Twitter
• Twitter like many other Social Networking sites provide APIs which help users in building
the integrating software applications with available web services.
• Tweepy is an open source Python package that gives you a very convenient way to
access the Twitter API with Python.
• To access the twitter RESTful API methods, you must have Twitter Developer account. If
you don’t have one, please get one today at https://developer.twitter.com/en
Scraping Twitter Data using Tweepy
• When you get your own developer account, you will be given your access
token and access token secret key. It will also give you key and a secret key
to authenticate yourself to use twitter API.
• (Note: Don’t share your credentials with anyone)
Scraping Twitter Data using Tweepy
• import tweepy
• import pandas as pd
• # Authenticate to Twitter API using developer account credentials
• consumer_key = " XXXXXXXX" # your credentials
• consumer_secret = " XXXXXXXX "
• access_token = " XXXXXXXX "
• access_token_secret = " XXXXXXXX "
• auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
• auth.set_access_token(access_token, access_token_secret)
• api = tweepy.API(auth)
Scraping Twitter Data using Tweepy
# Define the search query and specify the number of tweets to download
search_query = "iphone 14 pro"
max_tweets = 1000
# Define an empty list to store the tweets
tweets_list = []
# Download tweets using Tweepy Cursor object
for tweet in tweepy.Cursor(api.search_tweets, q=search_query,
tweet_mode='extended').items(max_tweets):
# Extract the relevant tweet information from different parts of a tweet
tweet_text = tweet.full_text
tweet_date = tweet.created_at
tweet_user = tweet.user.screen_name
tweet_favorites = tweet.favorite_count
tweet_retweets = tweet.retweet_count
Scraping Twitter Data using Tweepy
# Append the tweet information to the tweets list
tweets_list.append([tweet_text, tweet_date,
tweet_user, tweet_favorites, tweet_retweets])
# Create a Pandas dataframe to store the tweets
tweets_df = pd.DataFrame(tweets_list, columns=['Text',
'Date', 'User', 'Favorites', 'Retweets'])
# Save the dataframe to a CSV file
tweets_df.to_csv('iphone14pro_tweets.csv', index=False)
Summary
• In conclusion, collecting data is an important step in
data analytics and data science. The sources of data
and techniques to obtain data are diverse and depend
on the specific problem being addressed. It is important
to ensure that the data collected is reliable, relevant,
and representative of the population of interest.
References
1. https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-
soup/?ref=rp
2. https://developer.twitter.com/en/docs/twitter-api/v1
3. https://developer.twitter.com/en