Data Science
Liana Harutyunyan
Programming for Data Science
May 1, 2024
American University of Armenia
[email protected]
1
Data Science
• Data science is an interdisciplinary field that uses
algorithms, procedures, and processes to examine large
amounts of data in order to uncover hidden patterns,
generate insights, and direct decision making.
2
Data Science
• Data science is an interdisciplinary field that uses
algorithms, procedures, and processes to examine large
amounts of data in order to uncover hidden patterns,
generate insights, and direct decision making.
• It includes but is not limited to this stages:
• Data collection, cleaning and preprocessing
• Exploratory data analysis
• Machine learning (finding insights)
• Communication and interpretation of results
2
Programming for Data Science
Main programming languages used for Data Science are
• Python - which is more general-purpose programming
language
• R - which is primarily used for statistical analysis and
data visualization
3
Programming for Data Science
Main programming languages used for Data Science are
• Python - which is more general-purpose programming
language
• R - which is primarily used for statistical analysis and
data visualization
However, both of them, develop with high speed, and
sometimes it just comes to personal preference.
3
Data Science and Machine Learning
Machine Learning is a part of Data Science.
Deep Learning is a part of Machine Learning.
4
Main directions
Machine Learning has 3 main techniques to learn something
(algorithms also divide into these categories):
• Supervised Learning - when you have labeled dataset,
both inputs and outputs
• regression
• classification
5
Main directions
Machine Learning has 3 main techniques to learn something
(algorithms also divide into these categories):
• Supervised Learning - when you have labeled dataset,
both inputs and outputs
• regression
• classification
• Unsupervised Learning - unlabeled dataset, only input,
we provide the machine with data and ask to look for
hidden features
5
Main directions
Machine Learning has 3 main techniques to learn something
(algorithms also divide into these categories):
• Supervised Learning - when you have labeled dataset,
both inputs and outputs
• regression
• classification
• Unsupervised Learning - unlabeled dataset, only input,
we provide the machine with data and ask to look for
hidden features
• Reinforcement Learning - machine learns getting
positive and negative rewards from the environment
(similar to how the child learns)
5
Linear Regression example
Imagine you have a dataset of Flat area and price, and you
want to predict the price of the apartment based on the
area.
Let’s do the same exercise in both Python and R.
• Build a toy dataset of area-price relationship.
• Use built-in libraries and functions to predict the
relationship.
6
Classical Models
• In Python, the most classical Machine Learning
algorithms are implemented in sklearn package.
• In R, some algorithms are in the base library, others
have their individual packages. Currently unified
package is being developed.
7
Deep Learning Models
To write Deep Learning Models (Neural Nets) in Python, you
need either of the following libraries:
• PyTorch
• Tensorflow
They require more code than the classical models.
(For R, libraries for deep learning are being developed, but
not highly used.)
8
Other terms
Other terms used in Data Science:
• Web Scraping - Download data from Internet that are
not already available for download (tables)
• APIs - Download data using APIs (request - reply of data)