UNIT-II: Data Collection and Management- Introduction, Sources of data, Data collection and
APIs, Exploring and fixing data, Data storage and management, using multiple data sources
INTRODUCTION
Data collection and management are two critical, interconnected processes for any organization
or research project. Data collection is the systematic process of gathering information from
various sources to answer research questions, make informed decisions, and evaluate
outcomes. Data management is the subsequent practice of organizing, protecting, and storing
that data so it can be easily accessed, used, and analyzed.
SOURCES OF DATA
Data collection is the process of acquiring, collecting, extracting, and storing a voluminous
amount of data, which may be in a structured or unstructured form like text, video, audio, XML
files, records, or other image files used in later stages of data analysis. In the process of big
data analysis, “Data collection” is the initial step before starting to analyze the patterns or useful
information in the data. The data that is to be analyzed must be collected from different valid
sources.
The data that is collected is known as raw data, which is not useful now, but after cleaning the
impure and utilizing that data for further analysis forms information, the information obtained
is known as “knowledge”. Knowledge has many meanings like business knowledge or sales of
enterprise products, disease treatment, etc. The main goal of data collection is to collect
information-rich data. Data collection starts with asking some questions such as what type of
data is to be collected and what is the source of collection. Most of the data collected are of
two types known as “qualitative data“ which is a group of non-numerical data such as words,
sentences mostly focus on behavior and actions of the group and another one is “quantitative
data” which is in numerical forms and can be calculated using different scientific tools and
sampling data.
The actual data is then further divided mainly into two types known as:
Primary data
Secondary data
Primary data
The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the demand
and requirements of the target audience on which analysis is performed otherwise it would be
a burden in the data processing. Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee. Some
basic business or product related questions are asked and noted down in the form of notes,
audio, or video and this data is stored for processing. These can be both structured and
unstructured like personal interviews or formal interviews through telephone, face to face,
email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video. The survey method can be obtained
in both online and offline mode like through website forms and email. Then that survey answers
are stored for analyzing data. Examples are online surveys or surveys through social media
polls.
3. Observation method:
The observation method is a method of data collection in which the researcher keenly observes
the behavior and practices of the target audience using some data collecting tool and stores the
observed data in the form of text, audio, video, or any raw formats. In this method, the data is
collected directly by posting a few questions on the participants. For example, observing a
group of customers and their behavior towards the products. The data obtained will be sent for
processing.
4. Experimental method:
The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD,
LSD, FD.
CRD - Completely Randomized design is a simple experimental design used in data
analytics which is based on randomization and replication. It is mostly used for
comparing the experiments.
RBD - Randomized Block Design is an experimental design in which the experiment
is divided into small units called blocks. Random experiments are performed on each
of the blocks and results are drawn using a technique known as analysis of variance
(ANOVA). RBD was originated from the agriculture sector.
LSD - Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns. It is an arrangement of NxN squares with an
equal amount of rows and columns which contain letters that occurs only once in a row.
Hence the differences can be easily found with fewer errors in the experiment. Sudoku
puzzle is an example of a Latin square design.
FD - Factorial design is an experimental design where each experiment has two factors
each with possible values and on performing trail other combinational factors are
derived.
Secondary data
Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of
sources named internal source and external source.
1. Internal source:
These types of data can easily be found within the organization such as market record, a sales
record, transactions, customer data, accounting resources, etc. The cost and time consumption
is less in obtaining internal sources.
2. External source:
The data which can’t be found at internal organizations and can be gained through external
third party resources is external source data. The cost and time consumption is more because
this contains a huge amount of data. Examples of external sources are Government
publications, news publications, Registrar General of India, planning commission,
international labor bureau, syndicate services, and other non-governmental publications.
3 Other sources:
Sensors data: With the advancement of IoT devices, the sensors of these devices
collect data which can be used for sensor data analytics to track the performance and
usage of products.
Satellites data: Satellites collect a lot of images and data in terabytes on daily basis
through surveillance cameras which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is
uploaded by users on different platforms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through
keywords and queries searched mostly.
INTRODUCTION TO DATA COLLECTION AND APIS
Data collection is the process of gathering and measuring information from various sources to
gain insights and make informed decisions. In the modern digital world, a significant amount
of this data is dynamic and constantly updated, making manual collection inefficient. This is
where APIs (Application Programming Interfaces) become a powerful tool.
An API is a set of rules and protocols that allows different software applications to
communicate with each other. In the context of data collection, an API acts as a "contract" or a
"menu" that a data provider (the server) offers to a data consumer (the client). The client sends
a request according to the API's rules, and the server responds with the requested data.
How APIs are Used for Data Collection
Using an API for data collection is a standardized and efficient way to access structured data.
Instead of "web scraping," which involves extracting data directly from a website's HTML,
APIs provide a clean and reliable data stream. This is often the preferred method because it is
less prone to breaking and is a more respectful way to access a company's data.
The general process of using an API for data collection involves these key steps:
1. *Find the Right API:* Identify a public or partner API that offers the data you need. Many
organizations, from social media platforms to government agencies and weather bureaus,
provide APIs to access their data.
2. *Read the Documentation:* API documentation is a manual that explains how to use the
API. It specifies the "endpoints" (URLs for accessing specific resources), the required request
methods (like GET for retrieving data), and the parameters you can use to filter or customize
your request.
3. *Authentication:* Most APIs require an API key or other authentication methods (like
OAuth 2.0) to identify the user and control access. You need to obtain and securely store this
key.
4. *Make the Request:* Using a programming language (like Python or R) or an API client
tool (like Postman), you send a request to the API's endpoint, including your API key and any
necessary parameters.
5. *Process the Response:* The API will send a response, usually in a structured format like
JSON or XML. You then need to parse this data to extract the information you want.
An API, or *Application Programming Interface*, is a set of rules and protocols that allows
different software applications to communicate and exchange information. APIs act as an
intermediary, enabling two separate systems to talk to each other without needing to understand
the internal workings of the other.
Think of an API as a waiter in a restaurant. You, the client, give your order (a request) to the
waiter (the API). The waiter takes your order to the kitchen (the server), which processes it and
prepares your food (the response). The waiter then brings the food back to you. You don't need
to know how the kitchen works; you just need to know how to interact with the waiter to get
what you want.
Types of APIs
APIs are categorized in several ways, most commonly by their architectural style or by their
scope and purpose.
* *REST (Representational State Transfer)*: The most popular and widely used architectural
style for web APIs. RESTful APIs are stateless, meaning each request from a client to a server
contains all the information needed to understand the request, and the server doesn't store any
client context between requests. They use standard HTTP methods like GET, POST, PUT, and
DELETE to perform operations on resources.
* *Pros*: Simple, lightweight, and scalable.
* *Cons*: Can lead to "over-fetching" (getting more data than you need) or "under-fetching"
(needing multiple requests to get all the data).
* *Example*: The Twitter API, which lets you retrieve user tweets or post new ones using
simple URLs and HTTP methods.
* *SOAP (Simple Object Access Protocol)*: A protocol-based API that is more rigid and has
stricter rules than REST. SOAP APIs use XML for their message format and often rely on a
protocol called WSDL (Web Services Description Language) to describe the API's functions.
* *Pros*: Has built-in security and reliability features, making it ideal for enterprise-level
applications.
* *Cons*: Heavier, more complex, and requires more bandwidth due to the verbose XML
format.
* *Example*: Used in legacy systems, financial institutions, and telecommunications for
highly secure and structured transactions.
* *GraphQL*: A modern query language for APIs that was developed to solve the data-fetching
issues of REST. With GraphQL, the client specifies exactly what data it needs in a single
request, eliminating over-fetching and under-fetching.
* *Pros*: Efficient and flexible, allowing clients to get only the data they need with a single
API call.
* *Cons*: Can be more complex to set up and manage than REST, and it trades off some of
the native caching benefits of HTTP.
* *Example*: Used by Facebook, which developed it, and other modern applications that
require precise and efficient data retrieval, particularly in mobile apps.
* *RPC (Remote Procedure Call)*: One of the oldest API types, which allows a client to
execute a function or procedure on a remote server as if it were a local function.
* *Pros*: Simple and straightforward for performing specific actions.
* *Cons*: Can be tightly coupled and less flexible than other styles.
* *Example*: Can be used in microservices architectures where one service calls a function
in another service.
By Scope and Audience
This classification categorizes APIs based on who is allowed to use them.
* *Public (Open) APIs*: These APIs are publicly available and can be used by anyone. They
are often used to enable third-party developers to build applications on top of a company's
services.
* *Example*: The Google Maps API, which lets developers embed maps and location data
into their websites and apps.
* *Partner APIs*: Shared externally but only with specific business partners. Access is often
restricted and requires a special key or token.
* *Example*: An e-commerce site might provide a partner API to a shipping company to
automatically share order and tracking information.
* *Private (Internal) APIs*: Used exclusively within a single organization to connect internal
systems and services. They are not exposed to external developers.
* *Example*: An internal API might be used by a company's sales application to retrieve
customer data from its internal database.
* *Composite APIs*: These APIs combine multiple API calls into a single, streamlined request.
They are useful for complex tasks that would otherwise require multiple separate calls.
* *Example*: A composite API could get user profile information, recent posts, and
comments in one single request, which would have taken three separate calls with a standard
REST API.
EXPLORING AND FIXING DATA
When exploring and fixing data, you're essentially performing a two-part process: data
exploration and data cleaning (or data wrangling). This is a critical step in any data analysis,
machine learning, or business intelligence project to ensure the data is accurate, reliable, and
ready for use.
Data exploration is a crucial first step in the data science process. It involves using statistical
analysis and visualizations to understand the characteristics of a dataset, identify patterns, and
uncover insights. The main goal is to gain a deep understanding of the data before building
predictive models or performing other advanced analyses.
What is Data Exploration?
Data exploration, also known as Exploratory Data Analysis (EDA), is the initial process of
investigating a dataset. It is not about finding the final answers, but rather about asking
questions of the data. This phase is detective work; it helps data scientists understand the
dataset's structure, identify potential problems like missing values or outliers, and discover
relationships between variables. EDA helps inform subsequent steps in the data science
pipeline, such as feature engineering and model selection.
Key Techniques
Data exploration uses a variety of techniques, which can be broadly categorized into two types:
Statistical Techniques
Descriptive Statistics: This involves calculating measures like mean, median, mode,
standard deviation, and variance for numerical data. These statistics summarize the
central tendency and dispersion of the data. For categorical data, you might look at
frequency counts.
Correlation Analysis: This technique helps to understand the relationship between two
or more variables. A correlation coefficient (e.g., Pearson's r) measures the strength and
direction of a linear relationship.
Hypothesis Testing: Although more formal, simple hypothesis tests can be used to
compare groups or check for significant differences.
Visualization Techniques
Visualizations are a powerful way to explore data, as they can reveal patterns and relationships
that are hard to see from raw numbers. Some of the most common visualizations include:
Histograms: Used to visualize the distribution of a single numerical variable. They
show how often values fall into different ranges.
Box Plots: Excellent for summarizing the distribution of a numerical variable and
identifying outliers. They show the median, quartiles, and range of the data.
Scatter Plots: Used to visualize the relationship between two numerical variables. Each
point represents a data instance, and the pattern of the points can reveal a correlation.
Bar Charts: Ideal for comparing categorical data. They show the frequency or value
for different categories.
Heatmaps: Useful for visualizing correlations between many variables at once. A grid
of colors represents the strength of the relationships.
Why is it Important?
Data exploration is critical for several reasons:
Data Cleaning: It helps identify and address data quality issues, such as missing values,
incorrect data types, or duplicates.
Feature Understanding: It provides insights into the characteristics of each variable,
which can guide the process of feature selection and creation.
Outlier Detection: EDA helps spot unusual data points or outliers that could skew the
results of a model.
Hypothesis Generation: By exploring the data, data scientists can form new hypotheses
about the relationships between variables that can be tested later.
Informing Modeling: The insights gained from EDA can help choose the right machine
learning algorithms and guide model building, leading to more accurate and reliable
results.
Data Cleaning
Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, or
irrelevant records from a dataset. The goal is to improve the quality of the data so that it can be
used for analysis.
Common Problems and Their Solutions:
Missing Values: Missing data can be handled in several ways:
Removal: Delete rows or columns with a high percentage of missing values. This is suitable
for small datasets or when the number of missing values is minimal.
Imputation: Fill in missing values using a replacement strategy. Common methods include
using the mean, median, or mode for numerical data, or a most frequent value for categorical
data.
Inconsistent Data: This includes variations in data entry, such as "USA," "U.S.A.," and
"United States."
Solution: Standardize the data by creating a consistent format. Use techniques like string
manipulation to convert all variations to a single, unified value (e.g., "USA").
Duplicate Records: When the same data point appears multiple times.
Solution: Remove the duplicate rows, keeping only one instance. This is crucial for accurate
analysis.
Incorrect Data Types: For example, a column representing numerical data is stored as a string.
Solution: Convert the data type to the correct format (e.g., converting a string to an integer or
float).
Outliers: Data points that are significantly different from other observations. They can be valid
but may also be a result of data entry errors.
Solution: Depending on the context, you might remove them or transform them. For example,
you can cap the values at a certain percentile to reduce their impact.
Structural Errors: Incorrectly formatted data, such as a single column containing multiple
data points that should be in separate columns.
Solution: Reshape the data using techniques like splitting columns or pivoting tables to create
a logical and structured format.
DATA STORAGE AND MANAGEMENT, USING MULTIPLE DATA SOURCES
Data storage and management using multiple sources is a complex but crucial process that
involves integrating disparate datasets into a unified, accessible, and high-quality system. It is
a fundamental step for businesses aiming to make data-driven decisions and derive meaningful
insights.
Key Challenges
When working with multiple data sources, organizations face several challenges:
Data Silos: Data is often isolated in different departments or systems (e.g., CRM, ERP,
marketing platforms), making it difficult to get a complete view.
Data Variety: Data comes in many forms, including structured data from relational
databases, semi-structured data like JSON or XML, and unstructured data such as text
documents and images.
Data Inconsistency: Disparate sources may have different formats, naming
conventions, or data types, leading to inconsistencies that must be resolved.
Data Quality: Issues like missing values, duplicates, and inaccurate information are
common and can compromise the reliability of analysis.
The Integration Process
To overcome these challenges, a structured approach is essential. The most common method
for data integration is a process known as Extract, Transform, Load (ETL) or its modern
variant, Extract, Load, Transform (ELT).
1. Extract: This is the first step, where data is pulled from all identified sources. These
sources can be anything from relational databases and cloud applications to flat files
and APIs.
2. Transform: Data from different sources is cleaned, standardized, and aggregated to
ensure it's consistent and ready for analysis. This step includes:
o Data Cleansing: Removing duplicates, correcting errors, and handling missing
values.
o Standardization: Ensuring consistent data formats, such as dates or currency.
o Aggregation: Summarizing data to a higher level of granularity.
3. Load: The transformed data is loaded into a central repository, often a data warehouse
or a data lake.
While ETL is a traditional approach, ELT has become popular with the rise of cloud-based
data warehouses. In ELT, raw data is loaded directly into the data warehouse first, and the
transformation happens within the warehouse itself. This approach is beneficial because
modern data warehouses are powerful enough to handle large-scale transformations, and it
allows for greater flexibility.
Storage and Management Solutions
After integration, the data needs to be stored and managed in a way that supports the
organization's goals. The choice of storage solution depends on the type of data and the
intended use case.
Data Warehouse: This is a centralized repository that stores structured, historical data
from multiple sources. It's optimized for fast querying and analysis, making it ideal for
business intelligence and reporting.
Data Lake: A data lake is a vast storage repository that can hold large amounts of raw
data in its native format. It's more flexible than a data warehouse and is well-suited for
advanced analytics, machine learning, and data exploration.
Hybrid Solutions: Many companies use a combination of on-premise and cloud-based
solutions to create a flexible and scalable data management infrastructure.