0% found this document useful (0 votes)
11 views70 pages

Financial Machine Learning-Unit-1: Dr. J.Dhanalakshmi

This is about financial machine learning

Uploaded by

bhowmikpinki59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views70 pages

Financial Machine Learning-Unit-1: Dr. J.Dhanalakshmi

This is about financial machine learning

Uploaded by

bhowmikpinki59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Financial Machine

Learning-Unit-1
Dr. J.Dhanalakshmi
Objectives

● Basic concepts and techniques of Machine Learning.

● Understanding of the Supervised and Unsupervised learning techniques

● To study the various probability based learning techniques·

● To understand graphical models of machine learning algorithms.


Introduction

● ML is an application of AI that provides systems the ability to

automatically learn and improve from past experience and


predict things without being explicitly programmed.
Traditional Vs Machine Learning

The traditional program takes data Data and the output as the input given
and rules as input, the rules are to the ML and ML comes up with rules
applied to the input data to produce or patterns or models that it sees in the
the output. input data.
Need for Machine Learning

● Rapid increment in the production of


data.
● Solving complex problems, which are
difficult for a human.
● Decision making in various sector
including finance.
● Finding hidden patterns and extracting
useful information from data.
Relation of Artificial Intelligence, Machine Learning and Deep Learning
The broad discipline of
creating intelligent machine

Systems that learn from experience on


System can learn from past large data sets.
experience Using multi-layered networks for ML.
Types of Learning
Supervised Learning

● Model is getting trained on a labelled dataset.


● Labelled dataset is one which have both input and output parameters.
● Classification :output is having defined labels(discrete value).
● Regression : output is having continuous value.
Unsupervised Learning
● Machine learns without any supervision.
● To restructure the input data into new features or a group of objects with
similar patterns.
● Clustering:
Method of grouping the entities based on similarities.
● Dimension Reduction:
Method of reducing variables in a
training dataset used to develop
machine learning models.
Semi-supervised Learning
● The algorithm is trained upon a combination of labeled and unlabeled data.
● Small amount of labeled data and a very large amount of unlabeled data.
Reinforced Learning

Reinforcement learning is a feedback-based learning method, in which a learning


agent gets a reward for each right action and gets a penalty for each wrong action.
Successful Applications of Machine Learning

● Learning to recognize spoken words

● Learning to drive an autonomous vehicle

● Learning to classify new astronomical structures


Regression

Model is nothing but a mapping from the features (Inputs) to the labels (Output).

1. Linear Regression

2. Logistic Regression
Linear Regression

● To estimate real values based on continuous variables.


● Relationship between input (x) and an output variable (y), are fitting the
bestline.
Linear Regression
● Linear regression can be expressed mathematically as:
y= β0+ β 1x+ ε
Y= Dependent Variable
X= Independent Variable
β 0= intercept of the line
β1 = Linear regression coefficient (slope of the line)
ε = random error
● Polynomial Model
Y= b + w1*x1 + w2*x2 + ....... wm*xm.
Logistics regression

● Predicting the categorical dependent variable using a given set of independent


variables.
● Predicts the output of a categorical dependent variable.
● Output :categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc
● In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
● The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not based
on its weight, etc.
Logistics regression

Model

Output = 0 or 1

hΘ(x) = sigmoid (Z)

g(z) = 1/1+e-z
Naive bayes classifier

•The naive Bayes classifier applies to learning tasks where each instance x is
described by a conjunction of attribute values and where the target function f (x)
can take on any value from some finite set V.

•A set of training examples is provided, and a new instance is presented,


described by the tuple of attribute values (al, a2 ...an).

•The learner is asked to predict the target value (classification), for this new
instance.
Naive bayes classifier

●Naive :Assumes that the occurrence of a certain feature is independent of the


occurrence of other features.
●Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

●Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
Naive Bayes
●The formula for Bayes' theorem is given as:

●Where,P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.
●P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
●P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
●P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:

● Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
● It can be used for Binary as well as Multi-class Classifications.
● It performs well in Multi-class predictions as compared to the other
Algorithms.
● It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
● Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications
● It is used for Credit Scoring.

● It is used in medical data classification.

● It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.

● It is used in Text classification such as Spam filtering and Sentiment analysis.


SVM

● To create the best line or decision boundary that can segregate


n-dimensional space into classes so that we can easily put the
new data point in the correct category in the future.
● Classification or Regression
● SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as Support Vector
Machine.
Hyperplane and Support Vectors in the SVM algorithm

● Hyperplane: Hyperplanes are decision boundaries that help classify the data
points.
● Data points falling on either side of the hyperplane can be attributed to different
classes.
● The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
Types of SVM

SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means
if a dataset can be classified into two classes by using a single straight line.
● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line.
Issues in SVM
● SVM algorithm is not suitable for large data sets.

● SVM does not perform very well when the data set has more noise i.e. target

classes are overlapping.


Linearly Separable

Data is any data that can be plotted in a graph and can be separated into classes
using a straight line.
Non Linearly Separable
Kernelized SVM for non-linearly separable data. Say, we have some non-linearly
separable data in one dimension. We can transform this data into two-dimensions and
the data will become linearly separable in two dimensions.
mapping each 1-D data point
to a corresponding 2-D ordered pair.
any non-linearly separable data in any dimension,
We can just map the data to a higher dimension and
then make it linearly separable.
Kernal
● Linear kernel :
any non-linearly separable data in any dimension, we can just map the data to a higher
dimension and then make it linearly separable.
●Polynomial kernel :

●Gaussian kernel :
It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation
is:
Unsupervised Learning

No predefined category or labels available for target variables, and the goal often
is to create a category or label based on patterns available in data.
Clustering
Clustering is an unsupervised learning problem.
Key objective is to identify distinct groups (called clusters) based on some notion of
similarity within a given dataset.
Clustering analysis origins can be traced to the area of Anthropology and Psychology in
the 193’s.
The most popularly used clustering techniques are
● k-means (divisive)
● hierarchical (agglomerative)
K-Means
The key objective of a k-means algorithm is to organize data into clusters such that there
is high intra-cluster similarity and low inter-cluster similarity.

An item will only belong to one cluster, not several, that is, it generates a specific number of
disjoint, non-hierarchical clusters.

K-means uses the strategy of divide and concur, and it is a classic example for an expectation
maximization (EM) algorithm.

EM algorithms are made up of two steps:


● Expectation(E) and is used to find the expected point associated with a cluster.
● Maximization(M) and is used to improve the estimation of the cluster using knowledge from
the first step.
The two steps are processed repeatedly until convergence is reached.
Step 1: In the first step k centroids (in above case k=3) is randomly picked (only
in the first iteration) and all the points that are nearest to each centroid point are
assigned to that specific cluster. Centroid is the arithmetic mean or average
position of all the points.
Step 2: Here the centroid point is recalculated using the average of the
coordinates of all the points in that cluster. Then step one is repeated (assign
nearest point) until the clusters converge.

K-means is designed for Euclidean distance only.


Limitations of K-means

K-means clustering needs the number of clusters to be specified.


• K-means has problems when clusters are of differing sized, densities, and
non-globular shapes.
• Presence of outlier can skew the results.
Reinforcement Learning
● The basic objective of reinforcement learning algorithms is to map situations to
actions that yield the maximum final reward.
● While mapping the action, the algorithm should not just consider the immediate
reward but also next and all subsequent rewards.
● For example, a program to play a game or drive a car will have to constantly interact
with a dynamic environment in which it is expected to perform a certain goal.
Examples of reinforcement learning techniques are the
following:

● Markov decision process


● Q-learning
● Temporal Difference methods
● Monte-Carlo methods
Reinforcement learning

Let’s consider an example of a predefined system for teaching a new


trick to a dog,where you do not have to tell the dog what to do.
However, you can reward the dog, if it does it right or punish if it does
wrong.
With every step, it has to remember what made it get the reward or
punishment; this is commonly known as a credit assignment problem.
Introduction to NLP
• Natural language processing is a part of computer science and artificial intelligence which
deals with human languages.
• Text analytics/mining is the process of deriving meaningful information from natural
language text.

• It usually involves the process of structuring the input

text, deriving patterns and evaluating and interpreting

the output.
Need of NLP

• The need to study Natural Language Processing (NLP)


• Increasing role of language in human-computer
interaction
• Vast amount of unstructured textual data available.
• Now to make interactions between computers and humans,
computers need to understand natural languages used by humans.
• Natural language processing is all about making computers learn,
process, and manipulate natural languages.
Need of NLP

•Processing text data is an essential task as there is an


abundance of text available everywhere.
•Text data can be found in various sources such as books,
websites, social media, news articles, research papers,
emails, and more.
•However, text data is often unstructured, meaning it lacks a
predefined format or organization.
• To harness the valuable information contained within text
data, it is necessary to process and analyze it effectively.
Applications of NLP
Applications of NLP
• Twitter sentiment analysis or the Facebook
sentiment as it's being used heavily now.
• Customer chat services provided by various
companies and the process behind all of that
is because of the NLP.
• Speech recognition are also talking about
divorce assistants like Google assistant and
alexa.
• NLP to translate data from one language to
another
• Advertisement matching basically
recommendation of ads based on your
history.
Programming Languages for NLP
NLP Libraries
• Scikit-learn: It provides a wide range of algorithms for building
machine learning models in Python.
• Natural language Toolkit (NLTK): NLTK is a complete toolkit for all
NLP techniques.
• Pattern: It is a web mining module for NLP and machine learning.
• TextBlob: It provides an easy interface to learn basic NLP tasks like
sentiment analysis, noun phrase extraction, or pos-tagging.
• Quepy: Quepy is used to transform natural language questions into
queries in a database query language.
• SpaCy: SpaCy is an open-source NLP library which is used for Data
Extraction, Data Analysis, Sentiment Analysis, and Text Summarization.
• Gensim: Gensim works with large datasets and processes data streams.
List of NLP APIs
• IBM Watson API
IBM Watson API combines different sophisticated machine
learning techniques to enable developers to classify text into
various custom categories.
• Chatbot API
Chatbot API allows you to create intelligent chatbots for any
service.
• Speech to text API
Speech to text API is used to convert speech to text
• Text Analysis API by AYLIEN
Text Analysis API by AYLIEN is used to derive meaning and
insights from the textual content.
• Cloud NLP API
The Cloud NLP API is used to improve the capabilities of the
application using natural language processing technology.
Machine Learning Python Packages
There is a rich number of open source libraries available to facilitate practical machine
learning.
These are mainly known as scientific Python libraries and are generally put to use when
performing elementary machine learning tasks.
At a high level we can divide these libraries into data analysis and core machine learning
libraries based on their usage/purpose.
Machine Learning Python Packages

Data analysis packages:


These are the sets of packages that provide us the mathematical and scientific
functionalities that are essential to perform data preprocessing and
transformation.
Core Machine learning packages:
These are the set of packages that provide us with all the necessary machine
learning algorithms and functionalities that can be applied on a given dataset
to extract the patterns.
Data Analysis Packages

There are four key packages that are most widely used for data
analysis.
• NumPy
• SciPy
• Matplotlib
• Pandas
Data Analysis Packages

Pandas, NumPy, and Matplotlib play a major role and have the
scope of usage in almost all data analysis tasks.
SciPy supplements NumPy library and has a variety of key
high-level science and engineering modules, the usage of these
functions, however, largely depend on the use case to use case.
Data Analysis Packages
NumPy
● NumPy is the core library for scientific computing in Python.
● It provides a high-performance multidimensional array object, and tools for
working with these arrays.
● It’s a successor of Numeric package.
Pandas
Python has always been great for data munging; however it was not great for
analysis compared to databases using SQL or Excel or R data frames.
Pandas are an open source Python package providing fast, flexible, and
expressive data structures designed to make working with “relational” or
“labeled” data both easy and intuitive.
Pandas were developed by Wes McKinney in 2008 while at AQR Capital
Management out of the need for a high performance, flexible tool to perform
quantitative analysis on financial data.
Before leaving AQR he was able to convince management to allow him to open
source the library.
Pandas are well suited for tabular data with heterogeneously typed columns, as in
an SQL table or Excel spreadsheet.
Data Structures
Pandas introduces two new data structures to Python – Series and DataFrame, both of
which are built on top of NumPy (this means it’s fast).
Series This is a one-dimensional object similar to column in a spreadsheet or SQL table.
By default each item will be assigned an index label from 0 to N.
# creating a series by passing a list of values, and a custom index
label.
s = pd.Series([1,2,3,np.nan,5,6], index=['A','B','C','D','E','F'])
print s
# ---- output ----
# A 1.0
# B 2.0
# C 3.0
# D NaN
# E 5.0
# F 6.0
# dtype: float64
DataFrame

It is a two-dimensional object similar to a spreadsheet or an SQL


table. This is the most commonly used pandas object.
Creating a pandas dataframe

Input:
data = {'Gender': ['F', 'M', 'M'],'Emp_ID': ['E01', 'E02',
'E03'], 'Age': [25, 27, 25]}
Code:
df = pd.DataFrame(data, columns=['Emp_ID','Gender', 'Age'])
df # ---- output ----
# Emp_ID
Gender Age
#0 E01 F 25
#1 E02 M 27
#2 E03 M 25
Reading and Writing Data
# Reading
df=pd.read_csv('Data/mtcars.csv') # from csv
df=pd.read_csv('Data/mtcars.txt', sep='\t') # from text file
df=pd.read_excel('Data/mtcars.xlsx','Sheet2') # from Excel
# writing
# index = False parameter will not write the index values, default is True
df.to_csv('Data/mtcars_new.csv', index=False)
df.to_csv('Data/mtcars_new.txt', sep='\t', index=False)
df.to_excel('Data/mtcars_new.xlsx',sheet_name='Sheet1', index = False)
Basic Statistics Summary
describe()- will returns the quick stats such as count, mean, std (standard deviation), min,
first quartile, median, third quartile, max on each column of the dataframe

df = pd.read_csv('Data/iris.csv')
df.describe()
Basic Statistics Summary
cov() - Covariance indicates how two variables are related.
A positive covariance means the variables are positively related, while a negative
covariance means the variables are inversely related.
df = pd.read_csv('Data/iris.csv')
df.cov()
Basic Statistics Summary
corr() - Correlation is another way to determine how two variables are related.
In addition to telling you whether variables are positively or inversely related,
correlation also tells you the degree to which the variables tend to move together.
Pandas view function
Pandas view function
Pandas basic operations
Machine Learning Core Libraries
Python has a plethora of open source machine learning libraries.
Table gives a quick summary of the top 10 Python machine learning libraries ranked,
based on their number of contributors, and also shows the change in percentage of growth
in their contributors count between 2015 and 2016.
Reference Book
Mastering Machine Learning with Python in Six Steps A Practical Implementation Guide
to Predictive Data Analytics Using Python

You might also like