Back to all tutorials
Mathematical Foundations of Machine Learning
Mathematics
Text Tutorial

Mathematical Foundations of Machine Learning

December 18, 2023
20 min read
2,145 views
Linear Algebra
Calculus
Probability
Tutorial

Mathematical Foundations of Machine Learning

Machine learning might seem like magic, but it's built on solid mathematical foundations. Understanding these concepts isn't just academic—it's the key to building better models, debugging problems, and pushing the field forward.

Why Mathematics Matters in ML

Before diving into specific topics, let's understand why mathematics is crucial:

  • Model Design: Math helps us understand what models can and cannot do
  • Optimization: We need calculus to train models efficiently
  • Uncertainty: Probability theory helps us handle noisy, incomplete data
  • Dimensionality: Linear algebra lets us work with high-dimensional data

Linear Algebra: The Language of Data

Linear algebra is everywhere in machine learning. From representing data as vectors to understanding neural network transformations, it's fundamental.

Vectors and Matrices

import numpy as np

# Data as vectors
x = np.array([1.5, 2.3, 0.8])  # A data point
X = np.array([[1.5, 2.3, 0.8],  # Dataset as matrix
              [2.1, 1.9, 1.2],
              [0.9, 3.1, 0.5]])

# Matrix operations
W = np.array([[0.5, 0.3],     # Weight matrix
              [0.2, 0.8],
              [0.1, 0.4]])

# Linear transformation
output = X @ W  # Matrix multiplication
print(f"Transformed data shape: {output.shape}")

Eigenvalues and Eigenvectors

These are crucial for understanding Principal Component Analysis (PCA):

# Compute covariance matrix
cov_matrix = np.cov(X.T)

# Find eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print(f"Eigenvalues: {eigenvalues}")
print(f"Principal component: {eigenvectors[:, 0]}")

Key Linear Algebra Concepts for ML

  1. Matrix Multiplication: Foundation of neural networks
  2. Rank and Span: Understanding data dimensionality
  3. Norms: Measuring distances and regularization
  4. Decompositions: SVD, eigendecomposition for dimensionality reduction

Calculus: The Engine of Optimization

Machine learning models learn by optimizing objective functions. Calculus makes this possible.

Gradients and Derivatives

import matplotlib.pyplot as plt

def quadratic_function(x):
    return x**2 + 2*x + 1

def gradient(x):
    return 2*x + 2

# Gradient descent example
x = 5.0  # Starting point
learning_rate = 0.1
history = [x]

for i in range(20):
    grad = gradient(x)
    x = x - learning_rate * grad
    history.append(x)
    
print(f"Final x: {x:.4f}")
print(f"Final function value: {quadratic_function(x):.4f}")

Chain Rule and Backpropagation

The chain rule is the mathematical foundation of backpropagation:

Lw1=La3a3a2a2w1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_3} \cdot \frac{\partial a_3}{\partial a_2} \cdot \frac{\partial a_2}{\partial w_1}

# Simple example of chain rule in action
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Forward pass
x = 2.0
w = 0.5
z = w * x
a = sigmoid(z)

# Backward pass (chain rule)
dL_da = 2 * (a - 1)  # Assuming squared loss
da_dz = sigmoid_derivative(z)
dz_dw = x

dL_dw = dL_da * da_dz * dz_dw
print(f"Gradient w.r.t. weight: {dL_dw:.4f}")

Multivariable Calculus

Real ML problems involve functions of many variables:

# Gradient of a multivariable function
def multivar_function(w):
    x, y = w[0], w[1]
    return x**2 + 2*y**2 + x*y

def compute_gradient(w):
    x, y = w[0], w[1]
    return np.array([2*x + y, 4*y + x])

# Gradient descent for multivariable function
w = np.array([3.0, 2.0])
learning_rate = 0.1

for i in range(50):
    grad = compute_gradient(w)
    w = w - learning_rate * grad

print(f"Optimized weights: {w}")

Probability: Handling Uncertainty

Machine learning deals with uncertainty everywhere—noisy data, model uncertainty, prediction confidence.

Probability Distributions

import scipy.stats as stats

# Normal distribution - ubiquitous in ML
mu, sigma = 0, 1
x = np.linspace(-4, 4, 100)
pdf = stats.norm.pdf(x, mu, sigma)

plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.plot(x, pdf, 'b-', label=f'Normal(μ={mu}, σ={sigma})')
plt.title('Probability Density Function')
plt.legend()

# Bernoulli distribution - for binary classification
p = 0.7
outcomes = stats.bernoulli.rvs(p, size=1000)
plt.subplot(1, 2, 2)
plt.hist(outcomes, bins=2, density=True, alpha=0.7)
plt.title(f'Bernoulli Distribution (p={p})')
plt.show()

Bayes' Theorem

The foundation of Bayesian machine learning:

P(HE)=P(EH)P(H)P(E)P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}

# Bayesian inference example
def bayesian_update(prior, likelihood, evidence):
    posterior = (likelihood * prior) / evidence
    return posterior

# Example: coin flip inference
prior_fair = 0.5  # Prior belief coin is fair
likelihood_10_heads = 0.5**10  # Likelihood of 10 heads if fair
likelihood_10_heads_biased = 0.9**10  # If biased towards heads

# Evidence (total probability)
evidence = likelihood_10_heads * prior_fair + likelihood_10_heads_biased * (1 - prior_fair)

posterior_fair = bayesian_update(prior_fair, likelihood_10_heads, evidence)
print(f"Posterior probability coin is fair: {posterior_fair:.4f}")

Maximum Likelihood Estimation

# MLE for normal distribution
def log_likelihood_normal(data, mu, sigma):
    n = len(data)
    ll = -n/2 * np.log(2 * np.pi * sigma**2)
    ll -= np.sum((data - mu)**2) / (2 * sigma**2)
    return ll

# Generate sample data
np.random.seed(42)
true_mu, true_sigma = 3.0, 1.5
data = np.random.normal(true_mu, true_sigma, 100)

# MLE estimates
mle_mu = np.mean(data)
mle_sigma = np.std(data, ddof=0)

print(f"True parameters: μ={true_mu}, σ={true_sigma}")
print(f"MLE estimates: μ={mle_mu:.3f}, σ={mle_sigma:.3f}")

Information Theory

Information theory provides tools for understanding data compression, feature selection, and model complexity.

Entropy

def entropy(probabilities):
    """Calculate Shannon entropy"""
    p = np.array(probabilities)
    p = p[p > 0]  # Remove zero probabilities
    return -np.sum(p * np.log2(p))

# Example: entropy of coin flips
fair_coin = [0.5, 0.5]
biased_coin = [0.9, 0.1]

print(f"Fair coin entropy: {entropy(fair_coin):.3f} bits")
print(f"Biased coin entropy: {entropy(biased_coin):.3f} bits")

Mutual Information

from sklearn.metrics import mutual_info_score

# Generate correlated data
x = np.random.randint(0, 3, 1000)
y = (x + np.random.randint(0, 2, 1000)) % 3  # y depends on x

mi = mutual_info_score(x, y)
print(f"Mutual information between x and y: {mi:.3f}")

Putting It All Together

Here's how these mathematical concepts work together in a simple neural network:

class SimpleNeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Linear algebra: weight matrices
        self.W1 = np.random.randn(input_size, hidden_size) * 0.1
        self.W2 = np.random.randn(hidden_size, output_size) * 0.1
        self.b1 = np.zeros((1, hidden_size))
        self.b2 = np.zeros((1, output_size))
    
    def forward(self, X):
        # Linear algebra: matrix operations
        self.z1 = X @ self.W1 + self.b1
        self.a1 = sigmoid(self.z1)  # Calculus: derivative needed for backprop
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y, output):
        m = X.shape[0]
        
        # Calculus: chain rule for gradients
        dz2 = output - y
        dW2 = (1/m) * self.a1.T @ dz2
        db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
        
        da1 = dz2 @ self.W2.T
        dz1 = da1 * sigmoid_derivative(self.z1)
        dW1 = (1/m) * X.T @ dz1
        db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
        
        return dW1, db1, dW2, db2
    
    def train(self, X, y, epochs, learning_rate):
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)
            
            # Probability: compute loss
            loss = np.mean((output - y)**2)
            
            # Backward pass
            dW1, db1, dW2, db2 = self.backward(X, y, output)
            
            # Calculus: gradient descent update
            self.W1 -= learning_rate * dW1
            self.b1 -= learning_rate * db1
            self.W2 -= learning_rate * dW2
            self.b2 -= learning_rate * db2
            
            if epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Example usage
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int).reshape(-1, 1)

nn = SimpleNeuralNetwork(2, 4, 1)
nn.train(X, y, epochs=1000, learning_rate=0.1)

Conclusion

Mathematics isn't just the foundation of machine learning—it's the key to understanding, debugging, and improving your models. While libraries abstract away many details, understanding the underlying math helps you:

  • Choose appropriate models for your problems
  • Debug when things go wrong
  • Optimize performance and efficiency
  • Contribute to advancing the field

The journey to mathematical mastery is gradual, but every concept you understand makes you a better machine learning practitioner.


Next up: Check out my post on Bayesian Methods in Machine Learning to dive deeper into probabilistic approaches.

Related Posts

More content from the Mathematics category and similar topics

View All Posts
Projection onto Hyperspace
Article
Same Category
Mathematics
0
Projection onto Hyperspace

A simple explaination of projectiong points onto a hyperspace.

Shared topics:

Linear Algebra
Tutorial
Jul 29, 2025
Read More
Introduction to Functional Data Analysis with R
Article
Statistics
876
Introduction to Functional Data Analysis with R

Learn the fundamentals of Functional Data Analysis and how to implement key techniques using R.

Shared topics:

Tutorial
FDA
R
Feb 28, 202412 min read
Read More
Gaussian Processes Explained: A Visual Introduction
Article
Machine Learning
1,245
Gaussian Processes Explained: A Visual Introduction

A comprehensive tutorial on understanding Gaussian Processes with interactive visualizations and practical examples.

Shared topics:

Tutorial
Gaussian Processes
Statistics
Apr 2, 202415 min read
Read More

Comments & Discussion

Share this post