--- title: "Gram-Schmidt Orthogonalization and Regression" author: "Michael Friendly" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Gram-Schmidt Orthogonalization and Regression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r nomessages, echo = FALSE} knitr::opts_chunk$set( warning = FALSE, message = FALSE ) options(digits=4) par(mar=c(5,4,1,1)+.1) ``` This vignette illustrates the process of transforming a set of variables to a new set of uncorrelated (orthogonal) variables. It carries out the Gram-Schmidt process **directly** by successively projecting each successive variable on the previous ones and subtracting (taking residuals). This is equivalent by replacing each successive variable by its residuals from a least squares regression on the previous variables. When this method is used on the predictors in a regression problem, the resulting orthogonal variables have exactly the same `anova()` summary (based on "Type I", sequential sums of squares) as do original variables. ## Setup We use the `class` data set, but convert the character factor `sex` to a dummy (0/1) variable `male`. ```{r class1} library(matlib) data(class) class$male <- as.numeric(class$sex=="M") ``` For later use in regression, we create a variable `IQ` as a response variable ```{r class2} class <- transform(class, IQ = round(20 + height + 3*age -.1*weight -3*male + 10*rnorm(nrow(class)))) head(class) ``` Reorder the predictors we want, forming a numeric matrix, `X`. ```{r class3} X <- as.matrix(class[,c(3,4,2,5)]) head(X) ``` ## Orthogonalization by projections The Gram-Schmidt process treats the variables in a given order, according to the columns in `X`. We start with a new matrix `Z` consisting of `X[,1]`. Then, find a new variable `Z[,2]` orthogonal to `Z[,1]` by subtracting the projection of `X[,2]` on `Z[,1]`. ```{r} Z <- cbind(X[,1], 0, 0, 0) Z[,2] <- X[,2] - Proj(X[,2], Z[,1]) crossprod(Z[,1], Z[,2]) # verify orthogonality ``` Continue in the same way, subtracting the projections of `X[,3]` on the previous columns, and so forth ```{r} Z[,3] <- X[,3] - Proj(X[,3], Z[,1]) - Proj(X[,3], Z[,2]) Z[,4] <- X[,4] - Proj(X[,4], Z[,1]) - Proj(X[,4], Z[,2]) - Proj(X[,4], Z[,3]) ``` Note that if any column of `X` is a linear combination of the previous columns, the corresponding column of `Z` will be all zeros. These computations are similar to the following set of linear regressions: ```{r usinglm} z2 <- residuals(lm(X[,2] ~ X[,1]), type="response") z3 <- residuals(lm(X[,3] ~ X[,1:2]), type="response") z4 <- residuals(lm(X[,4] ~ X[,1:3]), type="response") ``` The columns of `Z` are now orthogonal, but not of unit length, ```{r ortho1} zapsmall(crossprod(Z)) # check orthogonality ``` We make standardize column to unit length, giving `Z` as an **orthonormal** matrix, such that $Z' Z = I$. ```{r ortho2} Z <- Z %*% diag(1 / len(Z)) # make each column unit length zapsmall(crossprod(Z)) # check orthonormal colnames(Z) <- colnames(X) ``` ### Relationship to QR factorization The QR method uses essentially the same process, factoring the matrix $\mathbf{X}$ as $\mathbf{X = Q R}$, where $\mathbf{Q}$ is the orthonormal matrix corresponding to `Z` and $\mathbf{R}$ is an upper triangular matrix. However, the signs of the columns of $\mathbf{Q}$ are arbitrary, and `QR()` returns `QR(X)$Q` with signs reversed, compared to `Z`. ```{r QR} # same result as QR(X)$Q, but with signs reversed head(Z, 5) head(-QR(X)$Q, 5) all.equal( unname(Z), -QR(X)$Q ) ``` ## Regression with X and Z We carry out two regressions of `IQ` on the variables in `X` and in `Z`. These are equivalent, in the sense that - The $R^2$ and MSE are the same in both models - Residuals are the same - The Type I tests given by `anova()` are the same. ```{r class2IQ} class2 <- data.frame(Z, IQ=class$IQ) ``` Regression of IQ on the original variables in `X` ```{r mod1, R.options=list(digits=5)} mod1 <- lm(IQ ~ height + weight + age + male, data=class) anova(mod1) ``` Regression of IQ on the orthogonalized variables in `Z` ```{r mod2, R.options=list(digits=5)} mod2 <- lm(IQ ~ height + weight + age + male, data=class2) anova(mod2) ``` This illustrates that `anova()` tests for linear models are *sequential* tests. They test hypotheses about the extra contribution of each variable over and above all previous ones, in a given order. These usually do not make substantive sense, except in testing ordered ("hierarchical") models.