Starting point of linear regression and why we prefer matrix algebra

Ironically, in many cases, especially in linear regression, calculations become simpler when approached through multi-dimensional matrix algebra rather than one-dimensional methods. Let's explore the foundational concepts of linear regression to understand this phenomenon.

Firstly the setup. Let us say we are given a data matrix XRn×pX\in\mathbb{R}^{n\times p} and a label vector YRnY\in\mathbb{R}^n. We have Y=Xβ+ϵY=X\beta+\epsilon where β\beta and ϵ\epsilon are unknowns. Further, let’s say that each component in ϵ\epsilon follows normal distribution N(0,σ2)N(0,\sigma^2) i.i.d. This will be our “model assumption.” (it is our choice - we can relax or strengthen as we need).

Now, our objective is to estimate the parameter β\beta based on the given data. One reasonable way to achieve this is by choosing β\beta in such a way that the sum of squares, defined as RSS=YXβ2RSS = \lVert Y-X\beta\rVert^2, is minimized.

In the one-dimensional case, the calculation can be described as follows: The RSS is expressed as i=1i=n(Yiβ0Xiβ1)2\sum_{i=1}^{i=n}(Y_i-\beta_0-X_i\beta_1)^2 and note that this is convex function. Obtain β0^\hat{\beta_0} and β1^\hat{\beta_1} by taking derivative of RSSRSS.

We arrive at β^1=i(XiXˉ)(YiYˉ)i(XiXˉ)2\hat{\beta}_1=\frac{\sum_i(X_i-\bar X)\cdot (Y_i-\bar Y)}{\sum_i(X_i-\bar X)^2} and β^0=Yˉβ^1Xˉ\hat{\beta}_0=\bar Y-\hat{\beta}_1\bar X. Additionally, we can also define ϵ^i=Yi(β^0+β^1Xi)\hat{\epsilon}_i = Y_i-(\hat{\beta}_0+\hat{\beta}_1 X_i) and put the estimate variance as σ^2=1n2iϵ^i2\hat{\sigma}^2 = \frac 1{n-2}\sum_i \hat{\epsilon}_i^2, where ‘2’ in the denominator accounts for the degrees of freedom. This is already cumbersome; for example, in conducting hypothesis testing involving β^\hat\beta and y^\hat y, it may be necessary to calculate the mean and variance of β^0\hat\beta_0, β^1\hat\beta_1, and ϵ^\hat\epsilon. While the probability is embedded in numerator, allowing some of these calculations feasible, the complexity escalates when attempting to confirm the unbiasedness of ϵ^\hat\epsilon.

In multivariate scenario, the process becomes more streamlined, as we’ll see. Employing the same trick of differentiation, we can deduce that β^=(XTX)1XTYN(β,σ2(XTX)1)\hat \beta = (X^TX)^{-1}X^T Y\sim N(\beta,\sigma^2(X^TX)^{-1}). Accordingly we can define the estimated variance ϵ^2=YXβ^2/(np)\hat\epsilon^2 = \lVert Y-X\hat\beta\rVert ^2/(n-p).

To understand why ϵ^2\hat\epsilon^2 is unbiased, we note that

(np)ϵ^2=YXβ^2=YX(XTX)1XTY2 (n-p)\hat{\epsilon}^2=\lVert Y-X\hat\beta\rVert^2=\lVert Y-X(X^TX)^{-1}X^T Y\rVert^2 and X(XTX)1XTX(X^TX)^{-1}X^T is an orthogonal projection matrix on span(X)span(X), because it is idempotent and symmetric. Thus, IX(XTX)1XTI-X(X^TX)^{-1}X^T is an orthogonal projection matrix on span(X)span(X^\perp). This means we can reduce the above formula to

(np)ϵ^2=UUTY=UTY2 (n-p)\hat{\epsilon}^2=\lVert UU^T Y\rVert=\lVert U^T Y\rVert^2

for some projection matrix UU.

Now consider the following sequence of equalities.

E(UTY2)=E(YTUUTY)=E(ϵTUUTϵ)=Tr[E((UTϵ)T(UTϵ))]=E(Tr[((UTϵ)(UTϵ)T)])=E(Tr[(σ2I)])=(np)σ2 \begin{aligned} E(\lVert U^T Y\rVert^2) &= E(Y^T U U^T Y) \\ &= E(\epsilon^T UU^T \epsilon) \\ &= Tr [E((U^T\epsilon)^T (U^T\epsilon))] \\ &= E(Tr[((U^T\epsilon)(U^T\epsilon)^T)]) \\ &= E(Tr[(\sigma^2 I)]) = (n-p)\sigma^2 \\ \end{aligned}

These steps effectively establish our claim.

To understand why β^ϵ^\hat\beta\perp\hat\epsilon, we can rely on the fact that zero covariance implies the independence in multivariate normal distribution. This essentially emerges from the unique structure of multivariate normal distribution function, specifically when it has diagonal covariance matrix elements, allowing it to separate into distinct components (i.e. it factors out). Why such a ‘miracle’ happens specifically for normal distribution isn’t immediately clear to me.

The reasoning unfolds as follows: to demonstrate the independence of β^\hat\beta and ϵ^\hat\epsilon, it’s sufficient to show that UTYU^T Y and β^=(XTX)1XTY\hat\beta=(X^TX)^{-1}X^T Y are independent. This is because ϵ^2\hat\epsilon^2 is simply a square of UTYU^T Y. From the earlier discussion, we are only required to show that the transformation matrix of Y, UTU^T and (XTX)1XT(X^T X)^{-1}X^T, are orthogonal, which holds because UTX=0U^TX=0 by definition.

With these insights, we can further delve into the statistical analysis of β^\hat\beta and y^\hat y.

What simplifies the multivariate calculation easier in our case? Firstly the usage of trace was pivotal, particularly when directly computing the expected value of XXTXX^T is impractical (by using trace, we could swap and used XTX=IX^TX=I essentially). Secondly, the unique attribute of multivariate normal distributions where zero covariance equates to independence plays a critical role.

This is not to suggest that complex calculations are inferior to more succinct ones. In mathematics, the aim is often to glean as much intuitive understanding as possible from various approaches. For example, in the 1D case, β^1=i(XiXˉ)(YiYˉ)i(XiXˉ)2\hat{\beta}_1=\frac{\sum_i(X_i-\bar X)\cdot (Y_i-\bar Y)}{\sum_i(X_i-\bar X)^2} can be interpreted as the sample correlation between X and Y, scaled by the ratio of their standard deviations, reflecting the relative strength of their relationship.

We’ve derived some fundamental formulas for multiple regression, setting the groundwork for our analysis. This illustrates how matrix algebra can streamline the computational process in regression analysis.