0% found this document useful (0 votes)
36 views5 pages

Week 1 Data Discovery

The document outlines the first module of a data science journey, focusing on discovering data sets, including their sources, types, and values. It highlights three main sources of data (public, private, and personal), as well as the distinctions between structured, unstructured, and semi-structured data. Additionally, it emphasizes the importance of understanding data types and values for effective data analysis.

Uploaded by

revanthkalla1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views5 pages

Week 1 Data Discovery

The document outlines the first module of a data science journey, focusing on discovering data sets, including their sources, types, and values. It highlights three main sources of data (public, private, and personal), as well as the distinctions between structured, unstructured, and semi-structured data. Additionally, it emphasizes the importance of understanding data types and values for effective data analysis.

Uploaded by

revanthkalla1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data discovery

Discover the Data - Introduction (Video 1)


Video summary
This video introduces the first module of a data science journey, focusing on discovering data
sets. It covers three main topics: sources of data sets, types of data sets, and the kinds of
values found in data sets.

Highlights:

• 00:003 Introduction to data science journey


o Importance of data sets
o Overview of the first module
o Discovering data sets
• 00:214 Sources of data sets
o Where to find data sets
o Internet, organizations, and personal devices
o Importance of knowing data sources
• 00:365 Types of data sets
o Structured data
o Unstructured data
o Semi-structured data
• 00:456 Kinds of values in data sets
o Different values in data sets
o Importance of understanding data values
o Helps in locating desired data sets

Before we begin the data science journey, you first need the data set. And to get the data set, you
need to know where it is. This is what we will be covering in the first module.

How do you discover data? There are three things that you will learn in this module.

• The first is, what are the different sources of data sets? Where can you find them?
• The second is, what are the different kinds of data sets? Structured, unstructured and semi-
structured.
• Third, in each data set, what are the different kinds of values that you will find?
This will give you a sense of locating the kind of data set that you want, either on the internet or
within your organization or even within your phones.

Sources of Data (video 2)


Here are the key points from the video “Discover the Data - Introduction”:

• Data Sources: Learn where to find datasets, including online, within organizations, and on
personal devices.
• Types of Data: Understand the differences between structured, unstructured, and semi-
structured data.
• Data Values: Identify the various kinds of values within datasets to locate the data you
need.
The video covers three main sources of data: public, private, and personal.
1. Public Data [00:00:16]1
o Open and free to access online
o Examples include Google Finance API and government data portals
2. Private Data [00:01:11]2
o Accessible to a limited audience, often within organizations
o Can be purchased, such as data from Dun & Bradstreet
3. Personal Data [00:01:29]3
o Data from personal devices and activities
o Examples include call logs, music listening history, and health app data

Video summary

This video explains the three main sources of data: public, private, and personal. It discusses
the characteristics, accessibility, and examples of each type, providing insights on how to
locate and utilize these data sets effectively.

Highlights:

• 00:143 Types of data sources


o Public, private, and personal data
o Public data is open and free
o Private data is accessible to a few people
• 01:014 Challenges with public data
o Hard to find specific data
o Data may not be in the desired format
o Tips for searching public data
• 02:315 Examples of public data sources
o Awesome public data sets catalog
o Google data set search
o Kaggle and government websites
• 12:006 Private data sources
o Found within organizations
o Examples include employee lists and financial details
o Often sensitive or proprietary
• 14:197 Personal data sources
o Data from personal devices and apps
o Examples include call logs and health tracking
o Unique to each individual

The video also provides tips on finding and using these data sources effectively.

• Awesome public datasets


• Google dataset search
• Kaggle datasets
• Data.gov and Data.gov.in
• Datameet
Types of datasets (Video 3)
Video summary

This video explains the different types of datasets, ranging from structured to unstructured,
and provides examples of each type. It also discusses the concept of semi-structured data and
how unstructured data can be converted into structured data for easier analysis.

Highlights:

• 00:143 Types of data


o Structured data has a defined schema
o Unstructured data lacks a predefined structure
o Semi-structured data lies between structured and unstructured
• 01:264 Structured data examples
o Database tables with defined fields and types
o Spreadsheets with specific columns and values
o Shapefiles containing geographic data
• 05:565 Semi-structured data examples
o Documents like PDFs and HTML files
o Web pages with mixed structured and unstructured information
o Emails with both structured headers and unstructured content
• 07:476 Unstructured data examples
o Text, images, audio, and video files
o Challenges in analyzing unstructured data
o Techniques in deep learning to extract structured information from unstructured
data

• Structured data has a schema: Databases, Spreadsheets, Forms, Shapefiles


• Semi-structured data has a flexible schema: JSON, HTML, Email
• Unstructured data has no schema: Text, Images, Audio, Video
• DBF opener
• MapShaper lets you view Shapefiles

Types of values (Video 4)


Video summary [00:00:00]1 - [00:07:09]2:

This video explains the different types of values found in data sets, including categorical,
numerical, and composite values. It discusses their characteristics and the operations that can
be performed on them.

Highlights:

• [00:00:00]3 Introduction to types of values


o Categorical, numerical, and composite values
o Characteristics of each type
o Operations that can be performed
• [00:00:28]4 Categorical values
o Fewer computations possible
o Examples: colors, cities
o Types: boolean, unordered, ordered, cyclical, unstructured
• [00:04:27]5 Numerical values
o Real numbers, integers, fractions, decimals
o Operations: addition, multiplication, ratios
o Examples: -2, 1.5, pi
• [00:05:00]6 Composite values
o Combine multiple elements
o Examples: date, time, spatial structures
o Support more operations than numerical types
• [00:06:28]7 Specialized composite structures
o Examples: IP addresses, currencies
o Collections of values
o Support extensive operations

• Categorical values may be:


o Boolean: True or False
o Unordered: No order, like colors
o Ordered: Order, like ratings
o Cyclical: Like days of the week
o Unstructured: Like names, images
• Numerical values may be:
o Integer: You can add or subtract
o Real: You can multiply or divide
• Composite values have an internal structure
o Temporal: Date, Time
o Spatial: Latitude, Longitude, Shapefiles
o Structured: JSON, XML with schema
o Specialized: IP addresses, URLs, Email addresses, Phone numbers, etc.

Week Summary (Video 5)


Video summary

This video provides a summary of the skills you should have acquired from the module,
focusing on finding and understanding data types. These skills are essential for effective data
analysis and gaining a competitive edge.

Highlights:

• 00:003 Introduction to skills


o Finding data
o Understanding data types
o Importance of these skills
• 00:174 Finding data
o More data leads to more analysis
o Discovering new data sources
o Competitive advantage
• 00:385 Understanding data types
o Structured data is easier to work with
o Numerical data is simpler than categorical
o Less effort needed for structured data
• 01:046 Comparing data sets
o Evaluating data sets
o Choosing the right data set
o Maximizing results with less effort

Based on what you have learnt in this module, you should be able to do two things: find data and
understand what type of data it is.

Both of these are powerful skills.

The more data you are able to find, the more analysis that you will be able to do that others are
unable to. Therefore, discovering new sources of data is a competitive advantage and a skill that
is well worth building.

The other, in terms of understanding the type of data, will give you an edge in terms of knowing
which data set is easier to work with. Structured data is easier to work with because you don't have
to do any additional work. You don't have to extract information from it. Numerical values are
easier to work with than, let's say, categorical or composite because there's less effot to extract
the structure. So you'll be able to compare two data sets and say that one gets more results by
spending in less time and effort.

Sample questions
• Find the UCI machine learning dataset on Wine Quality. (It has 4,898 rows.) What is the
highest pH value of the red wines? (ANS: 4.01)
• What's the official data portal of Russia? (ANS: https://data.gov.ru/?language=en)
• Are research papers structured, semi-structured or unstructured? (ANS: Semi-structured.
They have author names, abstracts, keywords, etc. but most content is free-form.)
• Are book titles categorical or composite? (ANS: Categorical. They don't have an underlying
structure.)

You might also like