Everyone is expert in machine learning… Everyone slaps on the resume or CV I know this… On interviews, when you ask candidate some questions about ML you quickly realize they actually do not know anything under the surface… which is painful experience for both parties.
Why is this is beyond me, I guess people think they can make money pretending to be data scientist …
Ok let’s dig into some interesting stuff…
Supervised Learning
When we want to learn from our data by specifying some target variable or value …
Classification – what class an instance of data fall into – simple as that
The target variables could be:
- nominal value, like true or false, zero or one, animal or plant …
- infinite number of numeric values … (regression!!)
Regression – prediction of a numeric value – do you remember those school days and “best-fit” line ….
Problem facing machine learning algorithms is that there are solutions to problems out there that are not deterministic … example would be motivation in humans… That is hard to model…
I for example used vectored fuzzywuzzy algorithm to mach sentences.
Ok, so what are expert systems! The expert systems are interesting part of machine learning. Basically “expert system” is system that can substitute something that is expert in something.
Think about mathematician or statistician doing something manually on numbers. Well expert system can do the same or better, more precise.
If you measure some subject, you are taking about some rows and columns. Well those columns could be called “features” or “attributes“.
We will have a table “instance” with “features“.
In bellow table we have a patterns of how different races and ethnicity handles (withdraws and deposits annually) for their bank accounts.
|
deposit |
withdraw |
account type |
race |
ethnicity |
| 1 |
100000 |
70000 |
checking |
white |
Serbia |
| 2 |
50000 |
45000 |
savings |
white |
America |
| 3 |
20000 |
25000 |
checking |
black |
America |
| 4 |
50000 |
10000 |
savings |
white |
Japan |
| 5 |
200000 |
100000 |
checking |
brown |
Argentina |
|
|
|
|
|
|
The first two features are numeric so they can take a decimal values.
The third feature is binary it can be in this case only 1 or 0.
The fourth column can be enumerated by integers, thus race colors represent numbers, 1,2,3,… n
So we want to do classification on this data set. first we need to come up with classification algorithm and train that algorithm. To do that we need to have a training set, a data .
We have 5 training examples
We have 4 features and one target variable
In classification problem the target variable are called classes and there is assumed to be a finite number of classes.
We will assume that our test set is above.
Machine learning algorithms have a desired level of accuracy. Can we describe that level of accuracy or knowledge representation. It depends. Some algorithms do have knowledge representation some don’t.
Examples of knowledge representation might be:
set of rules
probability distribution
example from the training set
Machine Learning tasks:
So we are working on the classification task.
Classification is prediction of class where the instance of data will be.
Regression is another task in machine learning. Regression is prediction of a numerical value.
Classification and regression are examples of supervised learning …
Opposite of supervised learning we have unsupervised learning. There is no label or target value in data under unsupervised learning . For example clustering, finding statistical values that describe data or reducing the data from many features to a small number of features for visualization purposes are unsupervised learning tasks.
Supervised learning tasks
- k-Nearest Neighbors Linear Algorithm
- Naive Bayes Locally weighted linear Algorithm
- Support vector machines Ridge Algorithm
- Decision trees Lasso
Unsupervised learning tasks
- k-Means Expectation maximization Algorithm
- DBSCAN Parzen window Algorithm
What is important to understand in ML?
How to choose algorithm?
Consider the goal?
What are you trying to get out?
maybe is probability of lowering risk for the bank or similar interests of users who order some product from the retail bank.
So the answer is what data you have or could collect.
Obviously, if you are looking for target values you need to look at supervised learning.
if the value you are looking for is 1/0, yes,no a/b/c/ black/yellow/white then you will use classification. if you are looking into number of values then you are going to use regression, e.a. 0.00 – 100.00, -100 to 100 or +∞ -∞
Opposite is for unsupervised learning …
Trying to fit data into some discrete group would need clustering algorithm. If you want to have some numerical estimate of how strong the fit is in each discrete group, then you should use density estimation algorithm.
It is absolute to know your data. Know your data:
Data features are nominal or continuous?
Are there missing values in the features?
If there are missing values, why is that?
Are there outliers in the data?
Are you looking for something that is very infrequent?
How to Develop ML Application