Machine Learning according to Sasha

Everyone is expert in machine learning… Everyone slaps on the resume or CV I know this… On interviews, when you ask candidate some questions about ML you quickly realize they actually do not know anything under the surface… which is painful experience for both parties.

Why is this is beyond me, I guess people think they can make money pretending to be data scientist …

Ok let’s dig into some interesting stuff…

Supervised Learning

When we want to learn from our data by specifying some target variable or value …

Classification – what class an instance of data fall into – simple as that

The target variables could be:

  1. nominal value, like true or false, zero or one, animal or plant …
  2. infinite number of numeric values …  (regression!!)

Regression – prediction of a numeric value  – do you remember those school days and “best-fit” line ….

Problem facing machine learning algorithms is that there are solutions to problems out there that are not deterministic … example would be motivation in humans… That is hard to model…

I for example used vectored fuzzywuzzy algorithm to mach sentences. 

Ok, so what are expert systems! The expert systems are interesting part of machine learning. Basically “expert system” is system that can substitute something that is expert in something.

Think about mathematician or statistician doing something manually on numbers. Well expert system can do the same or better, more precise.

If you measure some subject, you are taking about some rows and columns. Well those columns could be called “features” or “attributes“.

We will have a table “instance” with “features“.

In bellow table we have a patterns of how different races and ethnicity handles (withdraws and deposits annually) for their bank accounts.

deposit withdraw account type race ethnicity
1 100000 70000 checking white Serbia
2 50000 45000 savings white America
3 20000 25000 checking black America
4 50000 10000 savings white Japan
5 200000 100000 checking brown Argentina

The first two features are numeric so they can take a decimal values.

The third feature is binary it can be in this case only 1 or 0.

The fourth column can be enumerated by integers, thus race colors represent numbers, 1,2,3,… n

So we want to do classification on this data set. first we need to come up with classification algorithm and train that algorithm. To do that we need to have a training set, a data .

We have 5 training examples

We have 4 features and one target variable

In classification problem the target variable are called classes and there is assumed to be a finite number of classes.

We will assume that our test set is above.

Machine learning algorithms have a desired level of accuracy. Can we describe that level of accuracy or knowledge representation. It depends. Some algorithms do have knowledge representation some don’t.

Examples of knowledge representation might be:

set of rules

probability distribution

example from the training set

Machine Learning tasks:

So we are working on the classification task.

Classification is prediction of class where the instance of data will be.

Regression is another task in machine learning. Regression is prediction of a numerical value.

Classification and regression are examples of supervised learning …

Opposite of supervised learning we have unsupervised learning. There is no label or target value in data under unsupervised learning . For example clustering,  finding statistical values that describe data or reducing the data from many features to a small number of features for visualization purposes are unsupervised learning tasks.

 

Supervised learning tasks

  • k-Nearest Neighbors           Linear Algorithm
  • Naive Bayes                            Locally weighted linear Algorithm
  • Support vector machines    Ridge Algorithm
  • Decision trees                        Lasso

Unsupervised learning tasks

  • k-Means                                  Expectation maximization Algorithm
  • DBSCAN                                   Parzen window Algorithm

 

 

What is important to understand in ML?

How to choose algorithm?

Consider the goal?

What are you trying to get out?

maybe is probability of lowering risk for the bank or similar interests of users who order some product from the retail bank.

So the answer is what data you have or could collect.

Obviously, if you are looking for target values you need to look at supervised learning.

if the value you are looking for is 1/0, yes,no  a/b/c/ black/yellow/white then you will use classification. if you are looking into number of values then you are going to use regression, e.a. 0.00 – 100.00, -100 to 100 or +∞ -∞

 

Opposite is for unsupervised learning …

Trying to fit data into some discrete group would need clustering algorithm. If you want to have some numerical estimate of how strong the fit is in each discrete group, then you should use density estimation algorithm.

It is absolute to know your data. Know your data:

Data features are nominal or continuous?

Are there missing values in the features?

If there are missing values, why is that?

Are there outliers in the data?

Are you looking for something that is very infrequent?

How to Develop ML Application

 

 

 

 

 

 

 

 

 

 

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.