Back to all tutorials
Projection onto Hyperspace
Mathematics
Text Tutorial

Projection onto Hyperspace

July 29, 2025
0 views
Linear Algebra
Tutorial

Projection onto Hyperplane

Projections play an important role in Linear algebra, optimization, and machine learning. Some time we are interested in projecting points onto a lower dimensional space, or say, a general subspace of a higher dimensinal space.

In this note, we explain how to derive the projection formula of a point onto a hyperspace. The derivation works the same when the hyperspace is replaced by a plane or a line.

0. Notations

We consider vectors in Rn\mathbb{R}^n and use the inner product notation (a,b)=aTb(a, b) = a^Tb for vectors a,bRna, b\in \mathbb{R}^n. The norm of a vector aa is denoted by a=(a,a)\|a\| = \sqrt{(a, a)}. We denote by 00 the zero vecor of Rn\mathbb{R}^n. We also use the notation aba\perp b to mean that aa and bb are orthogonal, i.e. (a,b)=0(a, b) = 0.

1. Hyperspace

Let a,bRna, b\in \mathbb{R}^n be two vectors. The set defined by H={xRn:(a,x)=b}H = \{x\in \mathbb{R}^n : ( a, x ) = b\} is called a hyperplane of Rn\mathbb{R}^n spanned by aa passing through by bb.

Additionally, the vector aa is orthogonal to HH, i.e. (a,x)=0 for all xH. (a, x) = 0 \text{ for all } x\in H. In fact, if xHx\in H, then (a,x)=b(a, x) = b and (a,0)=b(a, 0) = b, which implies that (a,x0)=0(a, x - 0) = 0.

Fact 1 : Any vctor ww orthogonal to HH is a multiple of aa, i.e. w=λaw = \lambda a for some λR\lambda\in \mathbb{R}.

Proof: we already know that (a,v)=0(a, v)=0 for all vHv\in H. So, is any w=λaw=\lambda a multiple of aa. Assume wHw\perp H, i.e (w,v)=0(w, v)=0 for all vHv\in H. Now, for a fixed vv, we also have (λa,v)=0(\lambda a, v)=0. Hence,

(wλa,v)=(w,v)λ(a,v)=0(w-\lambda a , v) = (w, v) - \lambda (a, v) = 0

this is true for all vHv\in H. This means that wλaw-\lambda a has to be zero, i.e. w=λaw = \lambda a.

2. Projection onto HH

The prblem consists of projecting ww onto HH, that is, finding wHw_H such that wHHw_H\in H and wwHw - w_H is orthogonal to HH.

Claim : The projection of ww onto HH is given by

wH=wb(a,w)a2a.w_H = w - \frac{b - (a, w)}{\|a\|^2} a.

Proof: We know the following :

  1. wHHw_H\in H
  2. wHwHw_H - w \perp H, i.e wHww_H - w is orthogonal to HH.

From fact 1, we write wHw=λaw_H - w = \lambda a for some λR\lambda\in \mathbb{R}, that is

wH=w+λa(#)w_H = w + \lambda a \tag{\#}

Because, wHHw_H\in H, we must have (wH,a)=b(w_H, a)= b, according to the definition of HH. Using (#)(\#) above, we have

b=(wH,a)=(w+λa,a)=(w,a)+λ(a,a)=(w,a)+λa2\begin{align*} b &= (w_H, a) = (w + \lambda a, a) = (w, a) + \lambda (a, a)\\ &= (w, a) + \lambda \|a\|^2 \end{align*}

Therefore,

λ=b(w,a)a2.(##)\lambda = \frac{b - (w, a)}{\|a\|^2}. \tag{\#\#}

Inserting this into (#)(\#), we obtain

wH=w+b(w,a)a2aw_H = w + \frac{b - (w, a)}{\|a\|^2} a

Which proves our claim. It is the projection of ww onto the hyperplane HH.

3. Applications

As we have alrady mentioned earlier, projections are widely used whether in science or in engineering. To illustration this statement, we will consider some xamples of algorithms or techniques that use (the idea of ) orthogonal projections.

Our first examples are from the field of (Randomized) Linear Algebrae and Optimization.

1. The Karczmarz method

The Karczmarz method is an iterative methods intensively for solving overdetermied systems of linear equations. The method, named after its initiator, the Polish mathenatician Stefan Karczmarz, has gained popularity in the optimization community because of its simplicity and effectictiveness in solving large-scale linear systems, like those arise in machine learning (e.g. linear regression). For a broad litterature review,see these papers 1, 2, and 3.

Consider the system of linear equations Ax=bAx = b where ARm×nA\in \mathbb{R}^{m\times n} is a matrix with mm equations and nn variables, and bRmb\in \mathbb{R}^m is a vector of constants. That is, we have the system

a11x1+a12x2++a1nxn=b1a21x1+a22x2++a2nxn=b2am1x1+am2x2++amnxn=bm\begin{align*} a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n &= b_1 \\ a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n &= b_2 \\ &\vdots \\ a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n &= b_m \end{align*}

We assume the the system is consistent, i.e. it exists x\overline{x} such that Ax=bA\overline{x} = b and m>>nm>>n. The (classic) Karczmarz Method method iteratively projects an initial guess x0x_0 onto the hyperplane (ai,x)=bi(a_i, x)=b_i for i=1,,mi=1, \ldots, m. That is, at an iteration kk, the KM select a row aia_i of AA and computes the projection of xkx_{k} onto the hyperspace spanned by the row aia_i , i.e. Hai={x(ai,x)=b}H_{a_i}=\{ x\mid (a_i, x)=b \}, to obtain

xk+1=xk+bi(ai,xk)ai2ai\begin{align} x_{k+1} = x_k + \frac{b_i - (a_i, x_k)}{\|a_i\|^2} a_i \end{align}

We recogize that is is the same projection formula derived above.

What the (classic) KM methds does is, at iteration k, to project the current iterate xkx_k onto the hyperplane spanned by the ithi-th rows of AA, i.e HiH_i, and repeates this process untill a stopping condition is met.

The full algorithm is as follows:

Procedure Karczmarz(A, b, x₀, max_iter)
Input: Matrix A ∈ ℝ^{m×n}, vector b ∈ ℝ^m, initial guess x₀ ∈ ℝ^n, maximum iterations max_iter
1. Set x = x₀
2. For k = 1 to max_iter do:
    a. Select i_k ∈ {1, ..., m}, s.t i_k = k mod m +1 
    b. Let a_i be the i-th row of A
    c. Let b_i be the i-th entry of b
    d. Update:
        x ← x + [(b_i - ⟨a_i, x⟩) / ⟨a_i, a_i⟩] * a_i
3. Return x

For the python implementation, we can use the following code snippet:

def karczmarz(A, b, x0, max_iter=1000):
    m, n = A.shape
    x = x0.copy()
    for k in range(max_iter):
        i_k = k % m  # Cyclically select a row     
        a_i = A[i, :]
        b_i = b[i]
        # Project onto the hyperplane defined by a_i
        x += (b_i - np.dot(a_i, x)) / np.dot(a_i, a_i) * a_i
    return x

Analytical Example: Consider the consistent overdetermined system

A=[101211],b=[130],A=\begin{bmatrix} 1&0\\ 1&2\\ 1&-1 \end{bmatrix},\qquad b=\begin{bmatrix}1\\3\\0\end{bmatrix},

whose exact solution is

xˉ=[11]\bar x=\begin{bmatrix}1\\1\end{bmatrix}

(since x1=1x_1=1, x1+2x2=3x2=1x_1+2x_2=3\Rightarrow x_2=1, and x1x2=0x_1-x_2=0).

Kaczmarz update

Given row aia_i^\top and scalar bib_i, one step is the orthogonal projection

xk+1=xk+αai,α=biaixkai22.x_{k+1} = x_k + \alpha\, a_i, \qquad \alpha=\frac{b_i-a_i^\top x_k}{\|a_i\|_2^2}.

This enforces aixk+1=bia_i^\top x_{k+1}=b_i.

We start from x0=[22]x_0=\begin{bmatrix}-2\\ 2\end{bmatrix} and use cyclic rows i=1,2,3,1,2,3,i=1,2,3,1,2,3,\ldots.

**Iterations (arithmetic operations): **

  1. k=0k=0, row 1: a1=(1,0)a_1 = (1,\,0), a12=1\|a_1\|^2 = 1
    α=1(1,0)(2,2)1=3\displaystyle \alpha = \frac{1 - (1,\,0) \cdot (-2,\,2)}{1} = 3
    x1=x0+3(1,0)=(1,2)\displaystyle x_1 = x_0 + 3(1,\,0) = (1,\,2)

  2. k=1k=1, row 2: a2=(1,2)a_2 = (1,\,2), a22=5\|a_2\|^2 = 5
    a2x1=1+22=5a_2^\top x_1 = 1 + 2 \cdot 2 = 5
    α=355=25=0.4\displaystyle \alpha = \frac{3 - 5}{5} = -\frac{2}{5} = -0.4
    x2=x10.4(1,2)=(0.6,1.2)\displaystyle x_2 = x_1 - 0.4(1,\,2) = (0.6,\,1.2)

  3. k=2k=2, row 3: a3=(1,1)a_3 = (1,\,-1), a32=2\|a_3\|^2 = 2
    a3x2=0.61.2=0.6a_3^\top x_2 = 0.6 - 1.2 = -0.6
    α=0(0.6)2=0.3\displaystyle \alpha = \frac{0 - (-0.6)}{2} = 0.3
    x3=x2+0.3(1,1)=(0.9,0.9)\displaystyle x_3 = x_2 + 0.3(1,\,-1) = (0.9,\,0.9)

  4. k=3k=3, row 1:
    α=1(1,0)(0.9,0.9)=0.1\displaystyle \alpha = 1 - (1,\,0) \cdot (0.9,\,0.9) = 0.1
    x4=(1.0,0.9)\displaystyle x_4 = (1.0,\,0.9)

  5. k=4k=4, row 2:
    a2x4=1.0+20.9=2.8a_2^\top x_4 = 1.0 + 2 \cdot 0.9 = 2.8
    α=32.85=0.04\displaystyle \alpha = \frac{3 - 2.8}{5} = 0.04
    x5=(1.04,0.98)\displaystyle x_5 = (1.04,\,0.98)

  6. k=5k=5, row 3:
    a3x5=1.040.98=0.06a_3^\top x_5 = 1.04 - 0.98 = 0.06
    α=00.062=0.03\displaystyle \alpha = \frac{0 - 0.06}{2} = -0.03
    x6=(1.01,1.01)\displaystyle x_6 = (1.01,\,1.01)

Several papers have showed the converges of the KM algorithm under various conditions. One of the issue of the (classical) KM is that it can be slow to converge, especially for large systems: it is shown in the litterature that its converegence rate depends on the row order of the matruix AA.

To address this issue, Strohmer et al [1] introduced the Randomized Karczmarz Method (RKM), which randomly selects a row of AA at each iteration. They have shown randomization helps to improve the convergence rate and robustness of the method, especially for large-scale problems.

The RKM algorithm is as follows:

Procedure Randomized Karczmarz(A, b, x₀, max_iter)
Input: Matrix A ∈ ℝ^{m×n}, vector b ∈ ℝ^m, initial guess x₀ ∈ ℝ^n, maximum iterations max_iter  
1. Set x = x₀
2. For k = 1 to max_iter do:
    a. Randomly select i_k ∈ {1, ..., m} with probability proportional p_i = ||a_i ||/||A||_F^2
    b. Let a_i be the i-th row of A
    c. Let b_i be the i-th entry of b
    d. Update:
        x ← x + [(b_i - ⟨a_i, x⟩) / ⟨a_i, a_i⟩] * a_i
3. Return x

Its python implementation is as follows:

def randomized_karczmarz(A, b, x0, max_iter=1000):
    m, n = A.shape
    x = x0.copy()
    for k in range(max_iter):
        i_k = np.random.choice(m, p=np.linalg.norm(A, axis=1) / np.linalg.norm(A, ord='fro')**2)
        a_i = A[i_k, :]
        b_i = b[i_k]
        # Project onto the hyperplane defined by a_i
        x += (b_i - np.dot(a_i, x)) / np.dot(a_i,a_i) *a_i
    return x

To illustration the usefullness of projctions, we are going to to give a detailled proof of the main result of the paper by Strohmer et al [1]. For that, we need to precise some notations and definitions.

2. Least Squares Regression

Related Posts

More content from the Mathematics category and similar topics

View All Posts
Mathematical Foundations of Machine Learning
Article
Same Category
Mathematics
2,145
Mathematical Foundations of Machine Learning

A deep dive into the essential mathematical concepts behind modern machine learning algorithms.

Shared topics:

Linear Algebra
Tutorial
Calculus
Probability
Dec 18, 202320 min read
Read More
Introduction to Functional Data Analysis with R
Article
Statistics
876
Introduction to Functional Data Analysis with R

Learn the fundamentals of Functional Data Analysis and how to implement key techniques using R.

Shared topics:

Tutorial
FDA
R
Feb 28, 202412 min read
Read More
Gaussian Processes Explained: A Visual Introduction
Article
Machine Learning
1,245
Gaussian Processes Explained: A Visual Introduction

A comprehensive tutorial on understanding Gaussian Processes with interactive visualizations and practical examples.

Shared topics:

Tutorial
Gaussian Processes
Statistics
Apr 2, 202415 min read
Read More

Comments & Discussion

Share this post