0% found this document useful (0 votes)
11 views21 pages

Big Data Lecture # 08

The document provides an overview of machine learning (ML), detailing its types, applications, and tools used in big data analytics. It explains various ML paradigms such as supervised, unsupervised, semi-supervised, and reinforcement learning, highlighting their unique characteristics and use cases. Additionally, it covers predictive modeling processes, including model creation, testing, validation, and evaluation, emphasizing the importance of selecting appropriate models for specific problems.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views21 pages

Big Data Lecture # 08

The document provides an overview of machine learning (ML), detailing its types, applications, and tools used in big data analytics. It explains various ML paradigms such as supervised, unsupervised, semi-supervised, and reinforcement learning, highlighting their unique characteristics and use cases. Additionally, it covers predictive modeling processes, including model creation, testing, validation, and evaluation, emphasizing the importance of selecting appropriate models for specific problems.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

BIG DATA ANALYTICS

Lecture 8 --- Week 9


Content

 Overview of Machine Learning

 Task Types in Machine Learning

 Big Data and Machine Learning

 Tools for Machine Learning

 Overview of Predictive Modeling


Overview of Machine Learning

 Machine learning (ML) is concerned with algorithms and techniques that allow
computers to learn.
 The ML approach covers main domains, such as data mining, difficult to
program applications, and software applications.
 It is a collection of a variety of algorithms that can provide multivariate,
nonlinear, nonparametric regression or classification.
 The remarkable simulation capabilities of the ML-based methods have
resulted in their extensive applications in science and engineering.
 Recently, the ML techniques have found many applications in astronomy and
the geosciences and remote sensing.
 More specifically, these techniques are proved to be practical for cases where
the system’s deterministic model is computationally expensive or there is no
deterministic model to solve the problem.
Task Types in Machine Learning

 The first step in applying ML is teaching the algorithm using a training


dataset.
 The training dataset is a collection of independent variables with the
corresponding dependent variables.
 The machine uses the training data to learn how the independent variables
(input)
relate to the dependent variable (output).
 Later, when the algorithm is applied to new input data, it can apply that
relationship and return a prediction.
 After the algorithm is trained, it needs to be tested to get a measure of how
well it can make predictions from new data.
 This requires another dataset with independent and dependent variables, but
the dependent variables are not provided to the learner.
 The algorithm predictions are compared to the withheld data to determine
the quality of the predictions.
 This process requires a dataset that is large enough to be split in two for
training and testing.
 The type of ML method, the size and nature of the training and test dataset,
and the evaluation method should be chosen to optimize the trade-off
between bias and accuracy to give a meaningful result for the problem at
hand.
 ML algorithms can be classified into many different paradigms, based on the
desired outcome of the algorithm.
Supervised Learning

 Supervised learning is one of the most widely used ML algorithms.


 In supervised learning, the training data you use are already labeled.
 These training data are used to infer a learning algorithm or mapping function
from the input variable (X) to the output variable (Y ).
 The correct answers or desired outputs (labels), here, are already
 known, given a labeled set of input–output pairs, M = {(Xi,Yi )}Ni ; N is
simply the number of training examples.
 The training input Xi is a d-dimensional vector or numbers also known as
features, or attributes.
 The input Xi can be an image, an email message, a time series, a molecular
shape, or a graph.
 The output Yi, also known as a response variable, is a categorical or nominal
variable for a classification problem or real value for a regression problem.
 Classification algorithms and regression techniques are two types of
supervised learning widely used to develop predictive models.
Unsupervised Learning

 Unsupervised learning (also known as knowledge discovery) uses unlabeled,


unclassified, and categorized training data.
 The main goal of unsupervised learning is to discover hidden and interesting
patterns in unlabeled data.
 Unlike supervised learning, unsupervised learning methods cannot be directly
applied to a regression or a classification problem as one has no idea what the
values for the output might be.
 Clustering is the most common unsupervised learning algorithm used to
explore the data analysis to find hidden patterns or groupings in the data.
 Applications for cluster analysis include gene sequence analysis, market
research and object recognition.
 Common algorithms used in unsupervised learning include clustering, anomaly
detection, neural networks, and approaches for learning latent variable
models.
Semi-supervised Learning

 Semi-supervised learning is a combination of supervised and unsupervised ML


methods.
 Semi-supervised learning algorithms make use of partially labeled training
data – typically a small amount of labeled data with a large amount of
unlabeled data.
 Semi-supervised algorithms are trained on a combination of labeled and
unlabeled data.
 This is very useful for improving the learning accuracy.
Reinforcement Learning

 Reinforcement learning is a type of dynamic programming that trains algorithms


using a system of reward and penalty.
 The learning system, called agent in this context, learns with an interactive
environment.
 The agent selects and performs actions and receives rewards by performing
correctly and penalties
 for performing incorrectly.
 In reinforcement learning the agent learns by itself, without the intervention from
a human, the best strategy to maximize reward in a particular situation using
dynamic programming.
 Unlike unsupervised learning, reinforcement learning is different in terms of goals,
while the goal in unsupervised learning is to find a suitable action model that
would maximize the total cumulative reward of the agent.
 Represents the basic idea and elements involved in reinforcement learning
model.
 Typical practical applications of reinforcement learning include the building
of artificial intelligence for playing computer games, robotics and industrial
automation, text summarizing engines, dialogue agent (text, speech), etc.
Big Data and Machine Learning
 Big Data and Machine Learning are the blue-chips of the current IT Industry.
 The big data stores analyzes and extracts information out of bulk data sets.
 On the other hand, Machine learning is the ability to automatically learn and improve from
experience without being explicitly programmed.
 Machine Learning provides efficient and automated tools for data gathering, analysis, and
assimilation.
 In collaboration with cloud computing superiority, the machine learning ingests agility into
processing and integrates large amounts of data regardless of its source.
 Machine learning algorithms can be applied to every element of Big Data operation including:
 Data Segmentation
 Data Analytics
 Simulation
 All these stages are integrated create the big picture out of Big Data with insights, patterns,
which later get categorized and packaged into an understandable format.
 The fusion of Machine Learning and Big Data is a never-ending loop. The algorithms created
for certain purposes are monitored and perfected over time as the information is coming into
the system and out of the system.
Tools for Machine Learning

 Python – This is one of the most dominant languages for data science in the
industry today because of its ease, flexibility, open-source nature. It has
gained rapid popularity and acceptance in the ML community.
 R – It is another very commonly used and respected language in data science.
R has a thriving and incredibly supportive community and it comes with a
plethora of packages and libraries that support most machine learning tasks.
 Apache Spark – Spark was open-sourced by UC Berkley in 2010 and has since
become one of the largest communities in big data. It is known as the swiss
army knife of big data analytics as it offers multiple advantages such as
flexibility, speed, computational power, etc.
 Jupyter Notebooks – These notebooks are widely used for coding in Python.
While it is predominantly used for Python, it also supports other languages
such as Julia, R, etc.
 SAS – It is a very popular and powerful tool. It’s prevalently and commonly used in
the banking and financial sectors. It has a very high share in private organizations
like American Express, JP Morgan, Mu Sigma, Royal Bank of Scotland, etc.
 SPSS – Short for Statistical Package for Social Sciences, SPSS was acquired by IBM
in 2009. It offers advanced statistical analysis, a vast library of machine learning
algorithms, text analysis, and much more.
 Matlab – Matlab is really underrated in the organizational landscape but it is
widely used in academia and research divisions. It has lost a lot of ground in
recent times to the likes of Python, R, and SAS but universities, especially in the
US, still teach a lot of undergraduate courses using Matlab.
 Weka7 - stands for Waikato environment for knowledge analysis. Weka is an open
source, easy to use, and user-friendly for applied ML algorithms. It has graphical
user interface and also a command line interface where all features of the
software can be used from the command line. It is a useful tool when working with
massive datasets where scripting helps in the automation of the work.
Overview of Predictive Modeling

 Predictive modeling is the process of creating, testing and validating a model to best predict
the probability of an outcome.
 A number of modeling methods from machine learning, artificial intelligence, and statistics
are available in predictive analytics software solutions for this task.
 The model is chosen on the basis of testing, validation and evaluation using the detection
theory to guess the probability of an outcome in a given set amount of input data.
 Models can use one or more classifiers in trying to determine the probability of a set of data
belonging to another set.
 The different models available on the Modeling portfolio of predictive analytics software
enables to derive new information about the data and to develop the predictive models.
 Each model has its own strengths and weakness and is best suited for particular types of
problems. A model is reusable and is created by training an algorithm using historical data
and saving the model for reuse purpose to share the common business rules which can be
applied to similar data, in order to analyze results without the historical data, by using the
trained algorithm.
Business process on Predictive Modeling

 Creating the model : Software solutions allows you to create a model to run
one or more algorithms on the data set.
 Testing the model: Test the model on the data set. In some scenarios, the
testing is done on past data to see how best the model predicts.
 Validating the model : Validate the model run results using visualization tools
and business data understanding.
 Evaluating the model : Evaluating the best fit model from the models used
and choosing the model right fitted for the data.
Predictive modeling process

 The process involve running one or more algorithms on the data set where
prediction is going to be carried out.
 This is an iterative processing and often involves training the model, using
multiple models on the same data set and finally arriving on the best fit
model based on the business data understanding.
Models Category

 Predictive models: The models in Predictive models analyze the past


performance for future predictions.
 Descriptive models: The models in descriptive model category quantify the
relationships in data in a way that is often used to classify data sets into
groups.
 Decision models: The decision models describe the relationship between all
the elements of a decision in order to predict the results of decisions
involving many variables.
Features in Predictive Modeling

 Data Analysis and manipulation : Tools for data analysis, create new data
sets, modify, club, categorize, merge and filter data sets.
 Visualization : Visualization features includes interactive graphics, reports.
 Statistics : Statistics tools to create and confirm the relationships between
variables in the data. Statistics from different statistical software can be
integrated to some of the solutions.
 Hypothesis testing : Creation of models, evaluation and choosing of the right
model.

You might also like