ML Assignment 2: Exploring Data Types and Data Collection Methods
1. Understanding Data Types
Task 1: Define and Describe Data Types
1. Structured Data
o Definition: Data organized into rows and columns, typically stored in relational
databases.
o Example: An Excel spreadsheet containing employee records.
o Characteristics: Highly organized, easily searchable using SQL, suitable for
traditional data analysis tools.
2. Semi-structured Data
o Definition: Data that does not reside in a traditional database but has some
organizational properties (tags, markers).
o Example: JSON or XML files.
o Characteristics: Flexible structure, allows for hierarchical relationships, needs
special parsing tools.
3. Unstructured Data
o Definition: Data that lacks a predefined format or structure.
o Example: Videos, images, emails, social media posts.
o Characteristics: Requires preprocessing or AI techniques to analyze; storage and
management are more complex.
4. Quantitative Data
o Definition: Numeric data that represents measurable quantities.
o Example: Height, temperature, income.
o Characteristics: Supports statistical and mathematical analysis.
5. Qualitative Data
o Definition: Descriptive data that represents categories or qualities.
o Example: Customer feedback, product reviews.
o Characteristics: Analyzed using thematic or content analysis, not easily
quantifiable.
6. Primary Data
o Definition: Data collected directly by the researcher for a specific purpose.
o Example: Responses from a custom survey.
o Characteristics: Original, tailored to specific research needs, usually more
accurate.
7. Secondary Data
o Definition: Data collected by others, used for a purpose different from its original
intent.
o Example: Government census data.
o Characteristics: Readily available, less costly, but may not fit research needs
exactly.
Task 2: Implications for Data Analysis
• Structured Data: Easily analyzed using SQL and statistical software. Visualization tools
like bar charts, line graphs, and dashboards work well.
• Semi-structured Data: Requires parsing and transformation before analysis. Techniques
include JSON/XML parsers, followed by statistical or machine learning tools.
• Unstructured Data: Needs preprocessing (e.g., NLP for text, computer vision for
images). Advanced techniques are essential for extracting useful insights.
• Quantitative Data: Ideal for statistical tests (e.g., regression, correlation). Easily
visualized with histograms, scatter plots, and line charts.
• Qualitative Data: Analyzed using coding and thematic analysis. Visualizations include
word clouds, concept maps.
• Primary vs. Secondary Data: Primary data is more relevant but expensive. Secondary
data is faster to obtain but may lack specificity.
Task 3: Data Type Table
Data Type Example Analysis Method
Structured SQL Database Descriptive statistics, SQL queries
Semi-structured JSON/XML Parsing, keyword extraction
Unstructured Video/Text/Image NLP, image recognition, deep learning
Quantitative Test Scores, Age Statistical modeling, regression
Qualitative Interview Transcripts Thematic/content analysis
Primary User-conducted survey Tailored analysis, high relevance
Secondary Public health reports Comparative/trend analysis
2. Data Collection Methods
Task 1: Describe Data Collection Methods
1. Surveys
Description: Structured questionnaires used to collect responses from a
o
population.
o Purpose: Collect standardized data on opinions, behaviors, demographics.
o Use Cases: Market research, academic studies.
2. Experiments
o Description: Controlled tests where variables are manipulated to observe
outcomes.
o Purpose: Establish cause-effect relationships.
o Use Cases: Clinical trials, A/B testing in product development.
3. Observational Studies
o Description: Researchers observe subjects in natural settings without
interference.
o Purpose: Study behaviors and interactions in real-world environments.
o Use Cases: Ethnographic research, user experience studies.
Task 2: Data Type Suitability
• Surveys
o Suitable Data: Structured (Likert scale), Quantitative (age), Qualitative (open-
ended).
o Challenges: Risk of response bias, low response rates.
• Experiments
o Suitable Data: Primarily quantitative.
o Challenges: Costly, may raise ethical concerns, limited to specific settings.
• Observational Studies
o Suitable Data: Qualitative, unstructured (video, audio).
o Challenges: Observer bias, limited control over variables, potential privacy
issues.
Task 3: Impact on Data Quality
• Surveys: Can provide large-scale data quickly, but quality depends on question clarity
and respondent honesty.
• Experiments: High reliability due to controlled variables, but may not reflect real-world
behavior.
• Observational Studies: High ecological validity, but subject to interpretation and harder
to replicate.
Examples:
• Poorly designed surveys can yield unreliable results (e.g., ambiguous questions).
• Experiments with small samples may lack statistical power.
• Observer presence in observational studies may alter subject behavior (Hawthorne
effect).