Lecture Notes 2
Lecture Notes 2
Collection
Learn to effectively collect reliable and high-quality data for
informed analysis and decision-making
Agenda
1. What Types Of Data Can We Gather For Data Analysis?
2. An Overview Of Data Collection Methods
3. Definition Of Sampling And Sampling Techniques
4. Real-world Challenges In The Data Collection Pipeline
Data analysis cannot be performed without data. Data is the fundamental input for analysis,
as it provides the raw information that analysts use to extract insights, identify trends, and
make informed decisions.
1. Primary Data: Primary data refers to original data collected firsthand by the researcher
or organization for a specific purpose. It is often gathered through direct methods like
surveys, interviews, experiments, and observations.
Examples:
A company conducting customer surveys to gather feedback on a new product.
Researchers conducting experiments to collect data on the effects of a drug.
2. Secondary Data: Secondary data refers to data that has already been collected and
published by someone else, often for a different purpose than the current research. It
includes datasets from government reports, academic papers, market research reports,
and historical records.
Examples:
Publicly available government statistics like census data.
Academic studies using existing datasets.
3. Third-Party Data: Third-party data refers to data that is collected by an external
organization and then made available for use by others. This data is often sold or
provided through platforms, APIs, or subscription-based services.
Examples:
Social media analytics data from platforms like Twitter or Facebook.
Market research reports provided by consulting firms.
Financial data purchased from data vendors.
In the table below, the raw data (temperature readings, gender information, exam scores)
represents unprocessed facts. However, once analyzed, these numbers provide meaningful
information:
Criteria Primary Data Secondary Data Third-Party Data
Source Direct source (e.g., Existing datasets from External organizations, data
surveys, public or private sources providers, and platforms
experiments, (e.g., government (e.g., social media
observations) reports, academic analytics, market reports)
papers)
1. Surveys: Surveys are one of the most common methods of primary data collection,
often used to gather quantitative data from a large number of respondents. They can be
conducted through various means, such as online questionnaires, face-to-face
interviews, or phone surveys.
2. Experiments: Experiments involve manipulating one or more variables and observing the
effects. This method is typically used in scientific research or controlled studies to
establish cause-and-effect relationships between variables.
3. Observations: This method involves directly observing people, processes, or events in
real-time without intervening. It is commonly used in behavioral studies, ethnography,
and other qualitative research.
4. Databases: Using existing databases is a secondary data collection method. This
involves accessing datasets that have already been collected, often by government
agencies, research institutions, or private companies. Examples include census data,
financial reports, or academic research datasets.
These four methods cover a wide range of data collection techniques and can be combined
depending on the research goals and resources available.
Surveys are a widely used method for collecting data, particularly when you need to gather
information from a large number of respondents. They are flexible and can be adapted to a
variety of contexts, from market research to social science studies. Below, we explore
different tools and techniques to implement surveys effectively.
Google Forms Easy to use, integrates with Google Quick surveys, educational
Sheets for data analysis, purposes, free to use
customizable
To ensure that surveys yield valuable insights, consider these best practices:
Pilot Testing: Before launching a survey, test it on a small sample to identify potential
issues in question clarity or functionality.
Clear Instructions: Provide clear guidelines on how to complete the survey and an
estimated time to finish.
Incentives: Offer incentives like discounts or prize draws to encourage higher response
rates.
Follow-Up: Send reminders to non-respondents to increase completion rates.
Data Privacy: Ensure respondents are informed about how their data will be used and
stored, adhering to privacy laws and regulations.
Experiments can generally be categorized into controlled and field experiments. Each type
has its strengths, limitations, and appropriate use cases. Here’s a detailed breakdown of
both:
Control Over High control over independent and Low control over extraneous
Variables extraneous variables variables
Validity High internal validity (strong cause- High external validity (results
and-effect) apply to real-world)
Informed Consent: Participants should be fully informed about the purpose of the
experiment, any potential risks, and their right to withdraw at any time.
Confidentiality: Data collected from participants should be kept confidential and stored
securely.
Debriefing: After the experiment, participants should be debriefed about the true
purpose of the study, especially in cases where deception is used.
Avoiding Harm: Experiments should avoid causing any physical or psychological harm to
participants.
Labster: Labster offers virtual lab simulations that allow users to conduct interactive
experiments online. It is designed to make science experiments accessible remotely,
providing a simulated environment for students and researchers.
Best For: This tool is ideal for educational experiments, particularly in the
sciences, and is widely used in remote learning setups. It's perfect for teaching
concepts that require practical experience but cannot be conducted in a physical
lab.
Qualtrics: Qualtrics is a powerful survey and experiment platform that supports advanced
features such as A/B testing and data analysis. It allows users to design experiments, collect
data, and analyze results in one integrated system.
Best For: It's best suited for online experiments, especially those focused on
marketing, consumer behavior, and user experience studies. Researchers can
easily set up controlled experiments and analyze the results using built-in
analytics.
SPSS (IBM): SPSS is a comprehensive data analysis software that supports experimental
designs, statistical analyses, and hypothesis testing. It offers tools for designing
experiments, managing data, and conducting a range of statistical tests.
Best For: SPSS is highly effective for analyzing data from controlled experiments,
particularly in social sciences, psychology, and other fields where hypothesis
testing and statistical analysis are required.
Google Optimize: Google Optimize is a tool for running A/B tests and optimizing user
experience (UX) on websites. It enables users to test different website variations and analyze
the results based on user interactions.
Best For: This tool is specifically tailored for marketing experiments, website
optimization, and UX testing. It is commonly used for experiments aimed at
improving the performance of websites and digital products.
Types of Observation
To facilitate efficient and accurate data collection, various tools and technologies have been
developed. These tools range from traditional methods like manual recording to advanced
digital solutions that automate and enhance the observation process.
Databases are organized systems used to store, manage, and retrieve data efficiently. They
serve as essential tools for collecting and maintaining large volumes of structured data,
ensuring accessibility and reliability for analysis.
Source
Broadly, they are classified into two main categories based on their data structure and
usage: Relational Databases and Non-Relational (NoSQL) Databases. Each type has unique
characteristics and is suited to specific use cases.
1. Relational Databases: Relational databases use tables (rows and columns) to organize
data and define relationships between data points. They rely on structured schemas that
ensure consistency and accuracy.
Examples:
MySQL: Open-source database commonly used for web applications.
PostgreSQL: Advanced open-source database with support for complex queries
and custom functions.
Oracle Database: Enterprise-grade database known for its robustness and
scalability.
Use Case: Relational databases are ideal for structured data with consistent
relationships. For example, in an e-commerce platform, relational databases can
store: Customer information: Name, email, and contact details. Purchase records:
Transaction history tied to customer IDs. These relationships allow seamless tracking
of customer behaviors and transaction patterns.
2. Non-Relational (NoSQL) Databases: NoSQL databases handle data in a flexible and
scalable way, without requiring predefined schemas. They are designed to store
unstructured or semi-structured data, making them highly adaptable for modern data
needs.
Examples:
MongoDB: Document-oriented database storing data in JSON-like formats.
Cassandra: Distributed database designed for scalability and high availability.
Firebase: Real-time database for mobile and web app development.
Use Case: NoSQL databases excel in managing massive and dynamic datasets. For
instance, in a website with user-generated content, they can handle: User posts:
Blogs, comments, or images stored as documents. Clickstream data: Logs capturing
user interactions, such as page views and clicks, in real-time. This flexibility ensures
smooth scaling as the volume and diversity of data grow.
Each data collection method has its strengths, but they also come with specific flaws that
can impact data quality, reliability, and usability. Here's an overview of potential drawbacks
for each technique:
Limitations of Surveys
Limitations of Experiments
Limitations of Observations
Limitations of Databases
In data analysis, the quantity of data plays a critical role in shaping insights and decisions.
Collecting either too little or too much data can have adverse consequences, making it
essential to strike the right balance.
Lack of Representation:A small dataset may fail to capture the diversity or complexity of
the phenomenon being studied.
Example: Analyzing customer preferences with a sample of only 10 customers may
overlook trends and variations in the broader audience.
Statistical Limitations:Insufficient data leads to unreliable statistical conclusions,
increasing the likelihood of errors such as overfitting or underfitting.
Impact: Predictions or insights may lack robustness and generalizability.
Understand the Objective:Focus on collecting only the data needed to answer the
research question or solve the problem.
Tip: Start by defining the goals and identifying the variables critical to achieving them.
Conduct a Pilot Study:A small initial dataset can help identify whether the quantity and
type of data are sufficient for the intended analysis.
Benefit: This approach minimizes unnecessary effort while refining data collection
methods.
Leverage Sampling Techniques:Use appropriate sampling methods to work with
representative subsets instead of entire datasets.
Example: Stratified sampling ensures all key subgroups are included without
excessive data collection.
Regularly Evaluate Data Needs: Periodically reassess the volume of data required as
the project progresses to ensure alignment with goals.
The Goldilocks principle Just as Goldilocks sought porridge that was "just
right," data analysts should aim for datasets that are neither too small to be
meaningful nor too large to be practical
A sample is a subset of data selected from a larger group, known as the population, for the
purpose of analysis. Sampling is used to make inferences about the population without
examining every individual or data point. For example: Surveying 1,000 citizens to understand
the opinions of a country's entire population.
Key Insight A well-chosen sample allows for accurate, reliable conclusions
while saving time and resources
Definition The entire group of interest in a A subset of the population selected for
study or analysis analysis
What is Sampling?
Feasibility: Collecting data from an entire population can be difficult due to the size, cost,
and logistical challenges. For instance, reaching a large or global population may be time-
consuming and expensive, requiring significant resources. Sampling allows researchers to
gather insights from a smaller, more manageable group, making data collection more
feasible.
Efficiency: Sampling offers greater efficiency by reducing the time, effort, and costs needed
for data collection. A smaller sample means quicker data collection and analysis, as well as
fewer resources needed for outreach, processing, and storage. This makes sampling an
attractive option when resources are limited.
Types of Sampling
There are several techniques for sampling, each suited to different research needs. The
most common sampling methods include simple random sampling, stratified sampling, and
cluster sampling. These techniques help ensure that the sample is representative, which is
essential for accurate analysis and decision-making.
Source
Scenario: A university wants to survey 100 students about their satisfaction with
campus facilities. The university has 5,000 students enrolled. To ensure that each
student has an equal chance of being selected, the survey team uses simple random
sampling. They randomly select 100 students from the entire student body using a
random number generator.
Outcome: Each student in the university, regardless of their program or year, has an
equal chance of being included in the survey. This method is ideal when there is no
need to categorize students based on specific characteristics, and the goal is simply
to get an unbiased representation of the population.
Stratified Sampling
The population is divided into distinct subgroups, or strata, based on specific characteristics
(e.g., age, gender, income level). Then, a random sample is taken from each subgroup. This
ensures that every subgroup is properly represented in the final sample.
Source
Cluster Sampling
In cluster sampling, the population is divided into clusters (often geographically), and a
random sample of clusters is selected. Then, all individuals within the selected clusters are
surveyed. This method is useful when the population is spread out geographically.
Source
Scenario: A national educational organization wants to assess the effectiveness of a
new online learning platform. Since the platform is used by schools across the
country, it would be costly and time-consuming to survey every school. Instead, they
use cluster sampling. They divide the country into regions, then randomly select 10
regions. Afterward, they survey all the schools in these selected regions that use the
platform.
Outcome: By selecting clusters (regions) instead of individual schools, the
organization significantly reduces costs and logistical challenges. While this method
may lead to some bias if the chosen regions are not representative of the entire
country, it is still a cost-effective way to obtain a large sample when population
members are geographically dispersed.
Note Other sampling techniques include systematic sampling, where every k-th
individual is selected from a population after a random starting point, and
convenience sampling, where individuals are chosen based on ease of access
What is Sampling Bias? Sampling bias occurs when the sample collected for analysis does
not accurately represent the population from which it was drawn. This leads to skewed or
inaccurate results, which can significantly impact the validity of any analysis or conclusions
drawn from the data.
Source
Sources of Sampling Bias
Selection Bias: This occurs when certain individuals or groups are more likely to be
selected than others, often due to non-random selection methods. For example, only
selecting participants from a particular region or group could exclude others, making the
sample unrepresentative.
Nonresponse Bias: This happens when a significant portion of the selected sample does
not respond or participate. For instance, if only a small subset of survey respondents
answer a poll, those responses may not reflect the views of the larger population.
Response Bias: Response bias happens when participants provide inaccurate or biased
answers, either intentionally or unintentionally. This could result from the way questions
are worded, the survey environment, or social pressures. For instance, people may
exaggerate or provide socially desirable responses, leading to a biased dataset.
Measurement Bias: Measurement bias occurs when the tools or methods used to
collect data consistently produce inaccurate results. This can happen due to faulty
instruments, poor survey design, or misinterpretation of data. For example, using a faulty
scale in a study could result in incorrect weight measurements that misrepresent the
population.
Reporting Bias: Reporting bias is when only certain data or results are reported, usually
due to selective memory or a desire to highlight specific outcomes. This can occur if
researchers only report successful outcomes or ignore data that doesn't fit the
hypothesis, leading to a skewed interpretation of the results.
Data Quality Issues: One of the primary challenges in data collection is ensuring the quality
of the data. Data quality can be compromised in several ways, such as errors during data
entry, inconsistencies in how data is recorded, or gaps in data. Poor quality data can lead to
inaccurate conclusions and flawed analysis, which undermines the entire research or
decision-making process.
Common Causes: Human errors during data entry, lack of standardization in the
collection process, and faulty data collection tools.
Impact: Inaccurate or inconsistent data can lead to misleading insights and decisions,
wasting time and resources.
Solution: Implementing robust data validation techniques, standardized protocols,
and regular audits can help ensure that data collected is accurate and of high quality.
Data Privacy Concerns: In many industries, data privacy is a growing concern, especially
with the increasing amount of personal and sensitive information being collected. Data
privacy violations can have serious legal, ethical, and financial consequences. Data
collection methods must ensure that personal information is protected and that data
collection complies with regulations like GDPR, HIPAA, or other relevant privacy laws.