Linear regression

Formulation

Let $X \subseteq R^{d}$ and $Y \subseteq R$ be the (input) feature and target space respectively. A linear regression model takes the form

y = β_{0} + i = 1 \sum d β_{i} x_{i} + ε = β^{⊤} x + ε

where

β x = [β_{0} β_{1} \dots β_{d}]^{⊤} \in R^{d + 1}, = [1 x_{1} \dots x_{d}]^{⊤} \in R^{d + 1} .

Linear regression can also be formulated in a matrix form. Suppose there are $N$ total instances, i.e., $X = {x^{(i)}}_{i = 1}^{N}$ and $Y = {y^{(i)}}_{i = 1}^{N}$ . Then,

y = X β + ε

where

y X β ε = [y^{(1)} y^{(2)} \dots y^{(N)}]^{⊤} \in R^{N}, = x^{(1)}^{⊤} x^{(2)}^{⊤} \dots x^{(N)}^{⊤} = 11 ⋮ 1 x_{1}^{(1)} x_{1}^{(2)} ⋮ x_{1}^{(N)} x_{2}^{(1)} x_{2}^{(2)} ⋮ x_{2}^{(N)} \dots \dots ⋱ \dots x_{d}^{(1)} x_{d}^{(2)} ⋮ x_{d}^{(N)} \in R^{N \times (d + 1)}, = [β_{0} β_{1} \dots β_{d}]^{⊤} \in R^{d + 1}, = [ε^{(1)} ε^{(2)} \dots ε^{(N)}]^{⊤} \in R^{N} .

The values $ε$ are called error terms, or sometimes noise, and captures all other factors that influence $y$ other than $x$ . It is common to assume that it follows a Gaussian distribution:

ε \sim N (0, σ^{2}) .

Then, the likelihood function can be written as

p (y ∣ x, β) = N (y ∣ β x, σ^{2})

or equivalently

p (y ∣ X, β) = i = 1 \prod N N (y^{(i)} ∣ β^{⊤} x^{(i)}, σ^{(i)}^{2})

in multidimensional cases, assuming the data points are drawn independently from the distribution.

Basis Functions

The linear regression model assumes a linear relationship between $x$ and $y$ . However, in some cases the relationship may be nonlinear. To accommodate such cases, it is possible to apply a nonlinear transformation (called a basis function) $ϕ$ to the input $x$ , transforming the input into some other form $ϕ (x)$ . As long as the parameters of $ϕ$ are fixed, the model remains linear in the parameters, even if it is not linear in the inputs. Common choices of basis functions include:

Polynomial: $ϕ_{i} (x) = x_{i}^{i}$
Gaussian: $ϕ_{i} (x) = exp (- \frac{( x - μ _{i} ) ^{2}}{2 s ^{2}})$
Sigmoidal: $ϕ_{i} (x) = σ (\frac{x - μ _{i}}{s})$

Parameter Estimation

Analytic

Maximum Likelihood

To maximise the likelihood, we can alternatively minimise the negative log likelihood.

NLL (β, σ^{2}) = - lo g p (y ∣ X, β) = - lo g (i = 1 \prod N N (y^{(i)} ∣ β^{⊤} x^{(i)}, σ^{(i)}^{2})) = - i = 1 \sum N lo g N (y^{(i)} ∣ β^{⊤} x^{(i)}, σ^{(i)}^{2}) = - i = 1 \sum N lo g (\frac{1}{2 π σ ^{(i)} ^{2}} \cdot exp (- \frac{1}{2 σ ^{(i)} ^{2}} (y^{(i)} - β^{⊤} x^{(i)})^{2})) = \frac{1}{2} i = 1 \sum N [(y^{(i)} - β^{⊤} x^{(i)})^{2} \cdot \frac{1}{2 σ ^{(i)} ^{2}} + lo g (2 π σ^{(i)}^{2})]

The MLE is the point where $\nabla_{β} NLL (β, σ^{2}) = 0$ .

\nabla_{β} NLL (β, σ^{2}) = \nabla_{β} \frac{1}{2} i = 1 \sum N [(y^{(i)} - β^{⊤} x^{(i)})^{2} \cdot \frac{1}{2 σ ^{(i)} ^{2}} + lo g (2 π σ^{(i)}^{2})] \propto \frac{1}{2} \nabla_{β} i = 1 \sum N (y^{(i)} - β^{⊤} x^{(i)})^{2} = \frac{1}{2} \nabla_{β} ∥ X β - y ∥^{2} = \frac{1}{2} \nabla_{β} (X β - y)^{⊤} (X β - y) = \frac{1}{2} \nabla_{β} (β^{⊤} X^{⊤} X β - 2 β^{⊤} X^{⊤} y + y^{⊤} y) = \frac{1}{2} (2 β^{⊤} (X^{⊤} X) - 2 y^{⊤} X) = β^{⊤} (X^{⊤} X) - y^{⊤} X

Solving the equation $\nabla_{β} NLL (β, σ^{2}) = 0$ yields

(X^{⊤} X) β^{*} = X^{⊤} y

and therefore,

β^{*} = (X^{⊤} X)^{- 1} Xy .

This is also called the normal solution.

Mean Square Error

Alternatively, we can define a loss function and find the optimal point that minimises the loss. A common choice for the loss function is the mean square loss (MSE), which is given as follows:

MSE (β) = \frac{1}{N} i = 1 \sum N \frac{1}{2} (β^{⊤} x^{(i)} - y^{(i)})^{2} \propto \frac{1}{2} ∥ X β - y ∥_{2}^{2}

Now it suffices to find the point such that $\nabla_{β} MSE (β) = 0$ .

\nabla_{β} MSE (β) \propto \nabla_{β} ∥ X β - y ∥_{2}^{2} = \frac{1}{2} \nabla_{β} (X β - y)^{⊤} (X β - y) = \frac{1}{2} \nabla_{β} (β^{⊤} X^{⊤} X β - 2 β^{⊤} X^{⊤} y + y^{⊤} y) = \frac{1}{2} (2 β^{⊤} (X^{⊤} X) - 2 y^{⊤} X) = β^{⊤} (X^{⊤} X) - y^{⊤} X

Notice that the equation is the same as in the MLE scenario. The optimal parameters are given as:

β^{*} = (X^{⊤} X)^{- 1} Xy .

Numerical

Solving the normal solution includes computing the inverse of a large matrix $X^{⊤} X$ of size $(d + 1) \times (d + 1)$ . This can be a very expensive computation if $d$ , the input dimension, becomes large. In such cases, finding the optimal solution numerically can be an alternative.

Gradient descent
- Stochastic gradient descent

References

Wikipedia contributors. (2024b, January 15). Linear regression. Wikipedia. https://en.wikipedia.org/wiki/Linear_regression
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
Bishop, C. M. (2016). Pattern recognition and machine learning. Springer.
Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for machine learning. Cambridge University Press.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with Applications in R. Springer Science & Business Media.