Friday 11 November 2016

Machine learning basics



ML is a subset of AI which uses various mathematical(statistical) techniques to build predictive models with help of adequate training data Set. In ML, we are just framing a mathematical fn or a hypothesis ( y = f(x) ) where y is predicted output and x (x1, x2, x3..) is a series of independent factors(inputs) which derives the output y. 

x(x1, x2, x3..)  called as feature vectors
y called as Label                   

A trainingData contains two elements, the Feature Vector and Label( i.e input and the respective output). A training DataSet is set of multiple trainingData. 

As we know a mathematical function has certain co-efficient associated with each of its x values as below. Here co-efficient are unknown in the beginning and in Machine Learning, we are trying to determine the values of these co-efficient using some techniques and algorithms( Cost function, Gradient Descent) which I will be covering below.

y [h(x)])  = a0 + a1x1 + a2x2 + a3x3 + .... ( a1, a2, a3 --> ??????? ) 

A typical Machine Learning has below steps.

1.Feature detection: This is the technique of detecting different features. Since ML accepts numerical values, we need to convert all the features into numeric values. In case of text data, we would need to convert into numerical vectors. Also we would need to normalize and scale the data. 

2.Building the training DataSet: This is the step to build the training Set. Please note we must have a good amount of trainingSet in order to build a good prediction model. Typically 60% of DataSet would be used for actual training , 20% for evaluation of model and 20 % for testing the predictions. 


3.Configure the model based on the need: There are different models such as Linear regression, Logistic regression, Classification. Also we can configure with various cost functions, optimization algorithms to determine the co-efficient very efficiently.
  
4.Feed the training Set into model for training:  This is the process of identifying the co-efficient( which are called hyper-parameters) of the model. We may need to run through multiple iteration of feeding so that model would be more accurate.   

5.Evaluation: Here Model is evaluated with 20% dataset and a graph or model stats would be generated. 

6.Prediction: Here model can tested with 20% test data. At this stage, it can accept a new inputs and output can be predicted and verified.

Gradient Descent Optimization and cost function: Math behind 

Cost function is nothing but error or mistakes of the hypothesis function. Basically it is the amount of deviation of original output value and hypothesis output value. We have below formula(mean square error) for the cost function. Our goal is to have hypothesis function which should have less or zero error value.

cost function J(c) = (h(x) - y)^2/2m

How do we find the hypothesis function which has minimum cost value ? We use Gradient Descent algorithm to minimize the cost function and determine the co-efficient of our hypothesis function.

Choose the co-efficient(a0, a1, a2) in such a way that the cost function is less. To make it more clear, consider a graph with Cost (J(c)) in y axis and all other co -coefficients are in different axis. You can see a hyper-plane which has some top most points and some bottom most points. The algorithm says start with a random point at hyper plane and gradually moves down till you get a bottom most point, just like a ball rolls down from a mountain to its bottom point.


below is the algorithm. 

Repeat the step until the cost function j(c) is very less. 
a(j) <--  a(j) - alpha * partial derivative(J(c)) w.r.t j

alpha is the learning rate or the rate in which the algorithm converges to the next point. More the learning rate less would be the no of iterations.

a0, a1, a2... are the co-efficient. 

At each iteration, we will find new values for our co-efficient and gradually moves down the hill. At one particular point, the cost function cant be reduced further and that point is considered to be our target.

Just covered the basic math behind the simple ML. There are other different variants and techniques also used in ML. But this is the basic math around ML that every data engineer should be aware of. 

Over fitting and Under fitting problem

Some of the problem while developing a model to fit our data are over-fitting and under-fitting. Under-fitting happens when the data(training as well as new data) is not fitting to the model. The solution for this is to try out alternate machine learning algorithm to generate a best fit model for our data.

Over-fitting happens when the model fits correctly with training data, but fails to generalize the new data. It usually happens when there are few outliers in the data or when there are extra irrelevant features in the data. One way to resolve this issue is regularization. Regularization is a technique to reduce the impact of the co-efficient which are less relevant to the output.

As a data engineer, I feel you don't spend much time in understanding the mathematical concepts behind every techniques. You just need a high level understanding and should be able to configure your model, train and use it with help of some available frameworks. There are many opensource frameworks available in the market. Apache Spark MLLib is one among them. Earlier days we were not able to store or process huge data and run ML on it.

Now we have Spark that provides  distributed processing capability and able to chug huge amount of data on top of hadoop and MLLib provides collection of ready made machine learning models such as regression, classification, clustering and collaborative filtering. I will cover these techniques in my next blog. Deep learning is another subset of AI which uses neural network(using Multi Layer Perceptions) to build the predictive models. This is the most advanced AI technique and provides more accurate predictions. It's the technique behind the face recognition, self driven cars and etc..