Data Sources & Course Project
ANUP APREM
• BASED ON MATERIAL FROM DALT7002 (P08801): DATA SCIENCE FOUNDATIONS AT
OXFORD BROOKES UNIVERSITY
• SPONSORED BY BRITISH COUNCIL GOING GLOBAL EXPLORATORY GRANT
Data Sources
● These represent sources of data – where the data come
from?
● These are classified into three categories
Primary Data
Sources
Secondary Data
Sources
Tertiary Data
Sources
2
Data Sources
● Primary data sources: Original sources (material, events or evidence) as
they are actually happened. That is, data is not interpreted or analysed (it
shows first time or original materials).
– Examples include: dissertations, original research, original data, some
government reports, speeches, letters, interviews, etc.
● Secondary data sources: These explain primary resources and provide
analysis of those sources. In order words, these sources summarize and
anlyse data in order to provide added values to primary data sources.
– Examples are: textbooks, edited works (conference proceedings, etc),
review research articles, biographies, political analysis, etc.
3
Data Sources
Tertiary data sources:
● Distillation and collection of primary and secondary sources
● These are sources that index, organize or compile other
data sources. Examples include:
Dictionaries, Encyclopedia, Wikipedia,
Directories, Manuals, Indexing sources,
Guide books, etc.
4
Evaluating Data sources
• Large number of data source (Ex. CRAAP Test
Internet)
• Currency
• Large volume of data can be collected
from different sources • Relevance
• How to access quality of data? • Authority
• Accuracy
• Purpose
CRAAP Test
Currency Relevance
• Related to the timeline (Recency) of • Importance of data in relation to your
data needs
• When was the data created/updated? • Who is the intended audience/users of
Is it still valid to use? the data?
• Has the data been updated? Web • Percentage of useful data in the source
pages, links, etc.
• Comparison with other sources –
• Is it important for (your) work to use looking at or comparing a variety of
current data? sources to find out which one(s) to use
• Can old data be used?
Accuracy
CRAAP Test • Reliability or correctness of data
Authority • What are the sources where the data come
from?
• Creator’s credentials – who is the author
or source of data? • Cross validation of data on other sources – to
verify data from another source or personal
• Website links – do they provide knowledge
information about authors or sources of
data. • Is the data supported by evidence, experiment,
etc.
• Collection/analysis methodology used
Purpose
• Organizational track record, expertise or
qualification of authors • Purpose of data source – the reason of data
being created – teaching, research,
information, selling, etc.
• Is the purpose clearly identified?
• Does it provide fact or opinion, etc.
• Creator/organizational bias
Nominal vs Ordinal
Data Types and Characteristics • Nominal or categorical or qualitative data:
Discrete vs Continuous Data can only take a finite set of values.
• Discrete: It can take certain values from a Values have no meaningful ordering
finite dataset. For example, Number of between them. It provide descriptions or
people in a room, Number of PCs in a lab, labels but no ordering between them.
It is not possible to have 2.5 people in a
room or 3.5 PCs in a lab Ex: Gender: Male, Female
• Continuous data (or variable) It can take • Ordinal data
different values on an interval. Examples
include income, sales, age, etc. The order of values is significant
Examples: Feedback on service
1. not happy
2. happy
3. very happy
Marking scheme
Course Project • Novelty of the problem/data collection
• Group of 2 • Data selection and cleaning (CRAAP)
• Data collected from various sources
• Legal and/or ethical issues
• Data normalized to 3rd normal form and
stored in SQLite database. • Structured and semi-structured data
• Python for SQL queries • Data model and implementation (SQL +
Python code)
• Data visualization
• Data visualization (in Python)
• Data exploration
• Data exploration (in Python)
Constraints for this course
• Report and Presentation
1. Primary data source should be
https://data.gov.in/
2. Free to select any secondary data
sources
Initial Submission
• One page submission
• Proposal due: Sep 9, 2024
• Identify a novel problem (based on data available at data.gov.in)
• Identify any secondary source of data
• What are the attributes that you will collect?
• Data pre-processing/Data exploration/Data visualization that you plan
to perform
• Data should at least be 1000 records or more.
• Project Due: Nov 1, 2024 (No extension)
• Demonstration + Short Viva (Nov 4-8, 2024)