Mathematics and Statistics behind Machine Learning — PART 3

Shubhang Agrawal
Analytics Vidhya
Published in
7 min readJan 6, 2021

--

So this 3rd part of the blog as well as final part, as I will be covering the final topics for mathematics and statistics behind Machine Learning.

If you didn’t see my previous blog please do check it. Here’s the link to it.

Part 1 —

Part 2 —

So without any further delay lets get started with the final part of the series.

For the final part we are left with two topics Multivariate Calculus | Algorithm and complexity. So in this Blog we will cover these topics.

Multivariate Calculus

Before we move to definition, application and all the stuff about calculus I just want to mention that I am writing this blog expecting that you have general knowledge of calculus, such as differentiation and integration, because this might be required to understand the topic.

Multivariate means multiple variables. Therefore, multivariate calculus is a field of calculus which involves multiple variables. Multivariate calculus is also known as partial differentiation is used for the mathematical optimization of a given function (mostly convex as convex function tends to have minima). Using Multivariate Calculus, we can easily optimize our convex function algorithm easily to the minima or the lowest point, there are ways where it can get stuck to local minima but there are methods to avoid them also.

Gradients Descent

Gradient Descent is a machine learning algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values.

Formula:

Why do we need it?

Generally, what we do is, we find the formula that gives us the optimal values for our parameter. But in this algorithm, it finds the value by itself!

Some Basic Rules For Derivation:

( A ) Scalar Multiple Rule:

( B ) Sum Rule:

( C ) Power Rule:

( D ) Chain Rule:

Lets take an example to know understand more about gradient descent and multivariate calculus.

Gradient Descent in Linear Regression:

In linear regression, the model targets to get the best-fit regression line to predict the value of y based on the given input value (x). While training the model, the model calculates the cost function which measures the Root Mean Squared error between the predicted value (pred) and true value (y). The model targets to minimize the cost function.
To minimize the cost function, the model needs to have the best value of ?1 and ?2. Initially model selects ?1 and ?2 values randomly and then itertively update these value in order to minimize the cost function untill it reaches the minimum. By the time model achieves the minimum cost function, it will have the best ?1 and ?2 values. Using these finally updated values of ?1 and ?2 in the hypothesis equation of linear equation, model predicts the value of y in the best manner it can.

Linear Regression Cost Function:

Gradient Descent Algorithm For Linear Regression:

-> ?j     : Weights of the hypothesis.
-> h?(xi) : predicted y value for ith input.
-> j : Feature index number (can be 0, 1, 2, ......, n).
-> ? : Learning Rate of Gradient Descent.
Note: Here '?' represents Theta

Algorithm and complexity

Overview of Machine Learning Algorithms

When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods.

A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labeled examples.

Here I am going to mention kinds of machine learning algorithm grouped by similarity.

Regression Algorithms

The most popular regression algorithms are:

  • Ordinary Least Squares Regression (OLSR)
  • Linear Regression
  • Logistic Regression
  • Stepwise Regression
  • Multivariate Adaptive Regression Splines (MARS)
  • Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based Algorithms

The most popular instance-based algorithms are:

  • k-Nearest Neighbor (kNN)
  • Learning Vector Quantization (LVQ)
  • Self-Organizing Map (SOM)
  • Locally Weighted Learning (LWL)
  • Support Vector Machines (SVM)

Regularization Algorithms

The most popular regularization algorithms are:

  • Ridge Regression
  • Least Absolute Shrinkage and Selection Operator (LASSO)
  • Elastic Net
  • Least-Angle Regression (LARS)

Decision Tree Algorithms

The most popular decision tree algorithms are:

  • Classification and Regression Tree (CART)
  • Iterative Dichotomiser 3 (ID3)
  • C4.5 and C5.0 (different versions of a powerful approach)
  • Chi-squared Automatic Interaction Detection (CHAID)
  • Decision Stump
  • M5
  • Conditional Decision Trees

Clustering Algorithms

The most popular clustering algorithms are:

  • k-Means
  • k-Medians
  • Expectation Maximisation (EM)
  • Hierarchical Clustering

Artificial Neural Network Algorithms

The most popular artificial neural network algorithms are:

  • Perceptron
  • Multilayer Perceptrons (MLP)
  • Back-Propagation
  • Stochastic Gradient Descent
  • Hopfield Network
  • Radial Basis Function Network (RBFN)

Deep Learning Algorithms

The most popular deep learning algorithms are:

  • Convolutional Neural Network (CNN)
  • Recurrent Neural Networks (RNNs)
  • Long Short-Term Memory Networks (LSTMs)
  • Stacked Auto-Encoders
  • Deep Boltzmann Machine (DBM)
  • Deep Belief Networks (DBN)

Dimensionality Reduction Algorithms

The most popular dimensionality reduction algorithms are:

  • Principal Component Analysis (PCA)
  • Principal Component Regression (PCR)
  • Partial Least Squares Regression (PLSR)
  • Sammon Mapping
  • Multidimensional Scaling (MDS)
  • Projection Pursuit
  • Linear Discriminant Analysis (LDA)
  • Mixture Discriminant Analysis (MDA)
  • Quadratic Discriminant Analysis (QDA)
  • Flexible Discriminant Analysis (FDA)

To know more such kinds of algorithm visit below link -

Computational Complexity of ML Models

Time complexity can be seen as the measure of how fast or slow an algorithm will perform for the input size. Time complexity is always given with respect to some input size (say n).
Space complexity can be seen as the amount of extra memory you require to execute your algorithm. Like the time complexity, it is also given with respect to some input size (n).

The complexity of an algorithm/model is often expressed using the Big O Notation, which defines an upper bound of an algorithm, it bounds a function only from above.
The graph below visualizes different cases of complexities for algorithms.

http://bigocheatsheet.com/

To write computation complexity we are assuming as,
n= number of training examples, d= number of dimensions of the data,
k= number of neighbors

The complexity of K Nearest Neighbors to find the k closest neighbor

Train Time Complexity = O(knd)
Loops through every training observation and computes the distance d between the training set observation and new observation.

Time is linear with respect to the number of instances (n) and dimensions (d).

Space Complexity = O(nd)
K Nearest Neighbors store the data.
Testing takes longer because you have to compare every test instance to the whole training data.

The complexity of Logistic Regression

Training Time Complexity means in logistic regression, it means solving the optimization problem.
Train Time Complexity=O(nd)

Space Complexity = O(d)
Note: Logistic regression is very good for low latency applications.

The complexity of SVM

Training Time Complexity=O(n²)
Note: if n is large, avoid using SVM.

Run-time Complexity= O(k*d)
K= number of Support Vectors,d=dimentionality of the data

The complexity of Decision Tree

Training Time Complexity= O(n*log(n)*d)
n= number of points in the Training set
d=dimentionality of the data

Run-time Complexity= O(maximum depth of the tree)
Note: We use Decision Tree when we have large data with low dimentionality.

The complexity of Random Forest

Training Time Complexity= O(n*log(n)*d*k)
k=number of Decision Trees
Notes: When we have a large number of data with reasonable features. Then we can use multi-core to parallelize our model to train different Decision Trees.

Run-time Complexity= O(depth of tree* k)
Space Complexity= O(depth of tree *k)
Note: Random Forest is comparatively faster than other algorithms.

The complexity of Naive Bayes

Training Time Complexity = O(n*d)
Run-time Complexity = O(c*d)
We have to retrieve feature for each classes ‘c’

I tried to provide important information on Multivariate Calculus | Algorithm and Complexity in this last part. I hope you will find something useful here. Thank you for reading till the end. And if you like my Blog please hit the clap button below. Let me know if my blog was really useful. Also, if you did not check my other PARTS on same topic, below are links to them.

Part 1 —

Part 2 —

--

--