0% found this document useful (0 votes)
7 views6 pages

Mds101 Unit 1

Introduction to data science

Uploaded by

Srinivasa Rao T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

Mds101 Unit 1

Introduction to data science

Uploaded by

Srinivasa Rao T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

MDS101 – INTRODUCTION TO DATA SCIENCE

Total Teaching Hours: 52 No. of Hours / Week: 04

Course Objective:

 To understand the applications of Data Science


 To provide in-depth knowledge of Principles of Data Science, techniques and applications
 To gain a well-rounded introduction to the core concepts and technologies of Data Science
 An insight into data driven programming
Learning Outcome:
Upon completion of the course, students will be able to
 Explore data science and data engineering
 Apply Data-Driven Insights to Business and Industry
 Create Data Visualizations That Clearly Communicate Meaning
 Build Models That Operate Internet-of-Things Devices
 Apply Domain Expertise to Solve Real World Problems Using Data Science

UNIT-I [12 Hours]


Getting Started with Data Science – Facets of Data- structured data, Unstructured data, Natural
language, Machine-generated data, Graph based or network data, Audio, image, video streaming
data. Data Science Process: setting research goal-retrieving data- Data Preparation, Data
Exploration, Data modelling or model building- Presentation and Automation. Who Can Make Use
of Data Science, Analyzing the pieces of the Data Science Puzzle, Exploring the Data Science
Solution Alternatives, Letting Data Science Make You More Marketable.
UNIT-II. [10 Hours]
Exploring Data Engineering - Pipelines and Infrastructure -Grasping the Difference between Data
Science and Data Engineering- Identifying Big Data Sources-Making Sense of Data in Hadoop-
Identifying Alternative Big Data Solutions. Applying Data-Driven Insights to Business - Defining
Business-Centric Data Science -Differentiating between Business and Data Driven Business-
Benefiting from Business-Centric Data Science -Converting Raw Data into Actionable Insights with
Data Analytics -Taking Action on Business Insights -Distinguishing between Business Intelligence
and Data Science.
UNIT-III [10 Hours]
Using Data Science to Extract Meaning from Your Data: Machine Learning - Learning from
Data with Your Machine, Defining Machine Learning and Its Processes Considering Learning
Styles. Building Models That Operate Internet-of-Things Devices - Overviewing the Vocabulary
and Technologies Digging into the Data Science Approaches, Advancing Artificial Intelligence
Innovation
UNIT-IV [10 Hours]
Creating Data Visualizations: Following the Principles of Data Visualization Design, Data
Visualizations - The Big Three Designing to Meet the Needs of Your Target Audience, Picking the
Most Appropriate Design Style, Choosing How to Add Context, Selecting the Appropriate Data
Graphic Type, Choosing a Data Graphic. Using D3.js for Data Visualization - Introducing the
D3.js Library-Knowing When to Use D3.js-Getting Started in D3.js-Implementing More Advanced
Concepts.
UNITV
[10Hours]
Doing Data Science with Excel and Knime - Making Life Easier with Excel, Using KNIME for
Advanced Data Analytics. Applying Domain Expertise to Solve RealWorld Problems Using
Data Science - Data Science for driving growth in e-commerce.
Textbooks and References:

1. Introducing Data Science by Davy Cielen, Arno D.B.Meysman, Mohamed Ali, dream
tech press.

2. Data Science For Dummies (For Dummies (Computers)) 2nd Edition by Lillian
Pierson
3. An Introduction to Data Science by Jeffrey S. Saltz and Jeffrey M. Stanton
4. A Hands-On Introduction to Data Science by Chirag Shah

Getting Started with Data Science – Facets of Data.

Data science is focused on making sense of complex datasets and in building


predictive models from those data. As such, it encompasses a wide array of
different activities, from the upstream processes of acquiring, cleaning and
integrating data to downstream processes of analysis, modeling and
prediction. There are many facets of data science, including:
 Identifying the structure of data

 Cleaning, filtering, reorganizing, augmenting, and aggregating data

 Visualizing data

 Data analysis, statistics, and modeling

 Machine Learning

 Assembling data processing pipelines to link these steps

 Leveraging high-end computational resources for large-scale problems


Often, different tools address different parts of this process.
Therefore, interoperability among tools, based on common data structures
and interfaces, is an important element in enabling the construction of
complex, multifaceted data analysis pipelines. It is in this sense that we can
talk about an ecosystem for data science. For any particular application, you
might only be interested in a

structured data,
Structured data is the data which conforms to a data model, has a well define structure, follows a
consistent order and can be easily accessed and used by a person or a computer program. Structured data is
usually stored in well-defined schemas such as Databases.

Structured data is the data which conforms to a data model, has a well define structure,
follows a consistent order and can be easily accessed and used by a person or a computer
program.
Structured data is usually stored in well-defined schemas such as Databases. It is generally
tabular with column and rows that clearly define its attributes.
SQL (Structured Query language) is often used to manage structured data stored in
databases.
Characteristics of Structured Data:
 Data conforms to a data model and has easily identifiable structure
 Data is stored in the form of rows and columns
Example : Database
 Data is well organised so, Definition, Format and Meaning of data is explicitly known
 Data resides in fixed fields within a record or file
 Similar entities are grouped together to form relations or classes
 Entities in the same group have same attributes
 Easy to access and query, So data can be easily used by other programs
 Data elements are addressable, so efficient to analyse and process
Sources of Structured Data:
 SQL Databases
 Spreadsheets such as Excel
 OLTP Systems
 Online forms
 Sensors such as GPS or RFID tags
 Network and Web server logs
 Medical devices
Advantages of Structured Data:
 Structured data have a well defined structure that helps in easy storage and access of
data
 Data can be indexed based on text string as well as attributes. This makes search
operation hassle-free
 Data mining is easy i.e knowledge can be easily extracted from data
 Operations such as Updating and deleting is easy due to well structured form of data
 Business Intelligence operations such as Data warehousing can be easily undertaken
 Easily scalable in case there is an increment of data
 Ensuring security to data is easy
Note: Structured data accounts for only about 20% of data but because of its high degree
of organisation and performance make it foundation of Big data
To read Differences between Structured, Semi-structured and Unstructured data refer the
following article –
Structured Data
When we talk about structured data, we are often talking about tabular data(rectangular data) i.e.
rows and columns from a database. These tables further contain mainly two types of structured data:

1. Numerical Data
Data that is expressed on a numerical scale. It is further represented in two forms:
 Continuous — Data that can undertake any value in an interval. For example, the speed of a
car, heart rate, etc.
 Discrete — Data that can undertake only integer values, such as counts. For example, the
number of heads in 20 flips of a coin.
2. Categorical Data
Data that can undertake only a specific set of values representing possible categories. These are also
called enums, enumerated, factors, or nominal.
 Binary — A special case of categorical data where the features are dichotomous i.e. can accept
only 0/1 or True/False.
 Ordinal — Categorical data that has an explicit ordering. For example, five-star rating of a
restaurant(1,2,3,4,5)
But the question arises, why do you need to learn about the data? The answer is that without the
knowledge of the type of data, you will have no clue about applying the right statistical methods to
deal with that type of data.
For example, if one of the columns in a dataframe has ordinal data, we will have to preprocess it, and
in python, the scikit-learn package offers an OrdinalEncoder to deal with ordinal data.
The next step is to dive deeper into structured data and how we can use third party packages and
libraries to manipulate such structures. We have mainly two types of structures or data storage
models:
1. Rectangular
2. Non-Rectangular

Rectangular Data
Mostly all analyses in data science are done with a rectangular two-dimensional data object like a
dataframe, spreadsheet, CSV file, or a database table.
This mainly consists of rows that represent records(observations) and columns(features/variables).
Dataframe on the other hand is a special data structure with a tabular format that offers super-
efficient operations to manipulate the data.
Dataframes are the most commonly used data structures and it’s important to cover a few definitions
here:
Data frame
Rectangular data structure (like a spreadsheet) for efficient manipulation and application of statistical
and machine learning models.
Feature
A column within a dataframe is commonly referred to as a feature.
Synonyms — attribute, input, predictor, variable
Outcome
Many data science projects involve predicting an outcome — often a yes/no outcome.
Synonyms — dependent variable, response, target, output
Records
A row within a dataframe is commonly referred to as a record.
Synonyms — case, example, instance, observation, pattern, sample
Example:

Relational database tables have one or more columns designated as an index, essentially a row

number. This can vastly improve the efficiency of certain database queries. In a pandas dataframe,

an automatic integer index is created based on the order of the rows. In pandas, it is also possible to
set multilevel/hierarchical indexes to improve the efficiency of certain operations

Non-rectangular Data
Besides rectangular data, we have several other data structures which come under the umbrella of
non-rectangular data.
Spatial data structures, which are used in geolocation analytics, are more complex and different from
rectangular data structures. In the object representation, the focus of the data is an object (e.g., a
park) and its spatial coordinates. The field view, by contrast, focuses on small units of space and the
value of a relevant metric (pixel intensities, for example).
Graph data structures are used to represent relationships — physical, social, and abstract. For
example, Facebook or Twitter represents connections between people on the network as a graph of
social relationships. Graph structures are useful for certain types of problems, such as network
optimization and recommender systems.
Each of these data types has a specific set of methods in data science. The focus of this series is on
rectangular data which forms the foundational building blocks of predictive modeling.
Unstructured data, Natural language, Machine-generated data, Graph based or network data, Audio,
image, video streaming data. Data Science Process: setting research goal-retrieving data- Data
Preparation, Data Exploration, Data modelling or model building- Presentation and Automation.
Who Can Make Use of Data Science, Analyzing the pieces of the Data Science Puzzle, Exploring
the Data Science Solution Alternatives, Letting Data Science Make You More Marketable

You might also like