| Title: | A Grammar of Graphics Implementation of Biplots |
|---|---|
| Description: | A 'ggplot2' based implementation of biplots, giving a representation of a dataset in a two dimensional space accounting for the greatest variance, together with variable vectors showing how the data variables relate to this space. It provides a replacement for stats::biplot(), but with many enhancements to control the analysis and graphical display. It implements biplot and scree plot methods which can be used with the results of prcomp(), princomp(), FactoMineR::PCA(), ade4::dudi.pca() or MASS::lda() and can be customized using 'ggplot2' techniques. |
| Authors: | Vincent Q. Vu [aut] (ORCID: <https://orcid.org/0000-0002-4689-0497>), Michael Friendly [aut, cre] (ORCID: <https://orcid.org/0000-0002-3237-0941>), Aghasi Tavadyan [ctb] |
| Maintainer: | Michael Friendly <[email protected]> |
| License: | GPL-2 |
| Version: | 0.6.5 |
| Built: | 2026-05-14 07:24:57 UTC |
| Source: | https://github.com/friendly/ggbiplot |
This dataset gives rates of occurrence (per 100,000 people) various serious crimes in each of the 50 U. S. states, originally from the United States Statistical Abstracts (1970). The data were analyzed by John Hartigan (1975) in his book Clustering Algorithms and were later reanalyzed by Friendly (1991).
data(crime)data(crime)
A data frame with 50 observations on the following 10 variables.
statestate name, a character vector
murdera numeric vector
rapea numeric vector
robberya numeric vector
assaulta numeric vector
burglarya numeric vector
larcenya numeric vector
autoauto thefts, a numeric vector
ststate abbreviation, a character vector
regionregion of the U.S., a factor with levels Northeast South North Central West
The data are originally from the United States Statistical Abstracts (1970). This dataset also appears in the SAS/Stat Sample library, Getting Started Example for PROC PRINCOMP, https://support.sas.com/documentation/onlinedoc/stat/ex_code/131/princgs.html, from which the current copy was derived.
Friendly, M. (1991). SAS System for Statistical Graphics. SAS Institute.
Hartigan, J. A. (1975). Clustering Algorithms. John Wiley and Sons.
data(crime) library(ggplot2) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) ggbiplot(crime.pca, labels = crime$st , circle = TRUE, varname.size = 4, varname.color = "red") + theme_minimal(base_size = 14)data(crime) library(ggplot2) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) ggbiplot(crime.pca, labels = crime$st , circle = TRUE, varname.size = 4, varname.color = "red") + theme_minimal(base_size = 14)
Biplots are based on the Singular Value Decomposition, which for a data matrix is
but these are computed and returned in quite different forms by various PCA-like methods. This function provides a common interface, returning the components with standard names.
get_SVD(pcobj)get_SVD(pcobj)
pcobj |
an object returned by |
A list of four elements
The sample size on which the analysis was based
Left singular vectors, giving observation scores
vector of singular values, the diagonal elements of the matrix , which are also the square roots
of the eigenvalues of
Right singular vectors, giving variable loadings
data(crime) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) crime.svd <- get_SVD(crime.pca) names(crime.svd) crime.svd$Ddata(crime) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) crime.svd <- get_SVD(crime.pca) names(crime.svd) crime.svd$D
A biplot simultaneously displays information on the observations (as points) and the variables (as vectors) in a multidimensional dataset. The 2D biplot is typically based on the first two principal components of a dataset, giving a rank 2 approximation to the data. The “bi” in biplot refers to the fact that two sets of points (i.e., the rows and columns of the data matrix) are visualized by scalar products, not the fact that the display is usually two-dimensional.
The biplot method for principal component analysis was originally defined by Gabriel (1971, 1981). Gower & Hand (1996) give a more complete treatment. Greenacre (2010) is a practical user-oriented guide to biplots. Gower et al. (2011) is the most up to date exposition of biplot methodology.
This implementation handles the results of a principal components analysis using
prcomp, princomp, PCA and dudi.pca;
also handles a discriminant analysis using lda.
ggbiplot( pcobj, choices = 1:2, scale = 1, pc.biplot = TRUE, obs.scale = 1 - scale, var.scale = scale, var.factor = 1, groups = NULL, geom.ind = "point", geom.var = c("arrow", "text"), point.size = 1.5, ellipse = FALSE, ellipse.prob = 0.68, ellipse.linewidth = 1.3, ellipse.fill = TRUE, ellipse.alpha = 0.25, labels = NULL, labels.size = 3, alpha = 1, var.axes = TRUE, circle = FALSE, circle.prob = 0.68, varname.size = 3, varname.adjust = 1.25, varname.color = "black", varname.abbrev = FALSE, axis.title = "PC", clip = "on", ... )ggbiplot( pcobj, choices = 1:2, scale = 1, pc.biplot = TRUE, obs.scale = 1 - scale, var.scale = scale, var.factor = 1, groups = NULL, geom.ind = "point", geom.var = c("arrow", "text"), point.size = 1.5, ellipse = FALSE, ellipse.prob = 0.68, ellipse.linewidth = 1.3, ellipse.fill = TRUE, ellipse.alpha = 0.25, labels = NULL, labels.size = 3, alpha = 1, var.axes = TRUE, circle = FALSE, circle.prob = 0.68, varname.size = 3, varname.adjust = 1.25, varname.color = "black", varname.abbrev = FALSE, axis.title = "PC", clip = "on", ... )
pcobj |
an object returned by |
choices |
Which components to plot? An integer vector of length 2. |
scale |
Covariance biplot ( |
pc.biplot |
Logical, for compatibility with |
obs.scale |
Scale factor to apply to observations |
var.scale |
Scale factor to apply to variables |
var.factor |
Factor to be applied to variable vectors after scaling. This allows the variable vectors to be reflected
( |
groups |
Optional factor variable indicating the groups that the observations belong to.
If provided the points will be colored according to groups and this allows data ellipses also
to be drawn when |
geom.ind |
a text specifying the geometry to be used for the observations. Allowed
values are among |
geom.var |
a text specifying the geometry to be used for the variables. Allowed
values are among |
point.size |
Size of observation points. |
ellipse |
Logical; draw a normal data ellipse for each group? |
ellipse.prob |
Coverage size of the data ellipse in Normal probability |
ellipse.linewidth |
Thickness of the line outlining the ellipses |
ellipse.fill |
Logical; should the ellipses be filled? |
ellipse.alpha |
Transparency value (0 - 1) for filled ellipses |
labels |
Optional vector of labels for the observations. Often, this will be specified as the |
labels.size |
Size of the text used for the point labels |
alpha |
Alpha transparency value for the points (0 = transparent, 1 = opaque) |
var.axes |
logical; draw arrows for the variables? |
circle |
draw a correlation circle? (only applies when prcomp was called with
|
circle.prob |
Size of the correlation circle |
varname.size |
Size of the text for variable names |
varname.adjust |
Adjustment factor the placement of the variable names, >= 1 means farther from the arrow |
varname.color |
Color for the variable vectors and names |
varname.abbrev |
logical; whether or not to abbreviate the variable names, using |
axis.title |
character; the prefix used as the axis labels. Default: |
clip |
should geoms be clipped at the axis limits? Default: "on" |
... |
other arguments passed down |
The biplot is constructed by using the singular value decomposition (SVD) to obtain a low-rank
approximation to the data matrix (centered, and optionally scaled to unit variances)
whose rows are the observations
and whose columns are the variables.
Using the SVD, the matrix , of rank
can be expressed exactly as
where
is an orthonormal matrix of observation scores; these are also the eigenvectors
of ,
is an diagonal matrix of singular values,
is an orthonormal matrix of variable weights and also the eigenvectors
of .
Then, a rank 2 (or 3) PCA approximation to the data matrix used in the biplot
can be obtained from the first 2 (or 3)
singular values
and the corresponding as
The variance of accounted for by each term is .
The biplot is then obtained by overlaying two scatterplots that share a common set of axes and have a between-set scalar
product interpretation. Typically, the observations (rows of ) are represented as points
and the variables (columns of ) are represented as vectors from the origin.
The scale factor, allows the variances of the components to be apportioned between the
row points and column vectors, with different interpretations, by representing the approximation
as the product of two matrices,
The choice , assigning the singular values totally to the left factor,
gives a distance interpretation to the row display and
gives a distance interpretation to the column display.
gives a symmetrically scaled biplot.
When the singular values are assigned totally to the left or to the right factor, the resultant coordinates are called principal coordinates and the sum of squared coordinates on each dimension equal the corresponding singular value. The other matrix, to which no part of the singular values is assigned, contains the so-called standard coordinates and have sum of squared values equal to 1.0.
Scales and legend
When the 'groups' argument is not NULL, the function uses that value to set the aesthetics for 'color', 'fill' and 'shape'.
If you override the defaults using scale_color_discrete, etc., you may find that duplicate legends are produced.
To avoid this, you need to have the same name for aesthetics to be merged in the legend, for example,
scale_fill_discrete(name = 'Species') scale_color_discrete(name = 'Species')
or,
labs(fill = "Species", color = "Species")
a ggplot2 plot object of class c("gg", "ggplot")
Vincent Q. Vu., Michael Friendly
Gabriel, K. R. (1971). The biplot graphical display of matrices with application to principal component analysis. Biometrika, 58, 453–467. doi:10.2307/2334381.
Gabriel, K. R. (1981). Biplot display of multivariate matrices for inspection of data and diagnosis. In V. Barnett (Ed.), Interpreting Multivariate Data. London: Wiley.
Greenacre, M. (2010). Biplots in Practice. BBVA Foundation, Bilbao, Spain. Available for free at https://www.fbbva.es/microsite/multivariate-statistics/.
J.C. Gower and D. J. Hand (1996). Biplots. Chapman & Hall.
Gower, J. C., Lubbe, S. G., & Roux, N. J. L. (2011). Understanding Biplots. Wiley.
reflect, ggscreeplot;
biplot for the original stats package version;
fviz_pca_biplot for the factoextra package version.
data(wine) library(ggplot2) wine.pca <- prcomp(wine, scale. = TRUE) ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, varname.size = 4, groups = wine.class, ellipse = TRUE, circle = TRUE) # Easier interpretation if the axes are reflected wine.pca <- reflect(wine.pca) # Use direct labels rather than a legend means <- aggregate(cbind(PC1, PC2) ~ wine.class, data = wine.pca$x, FUN = mean) ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, varname.size = 4, ellipse = TRUE, circle = TRUE) + geom_label(data = means, aes(x=PC1, y=PC2, label = wine.class)) + theme(legend.position = 'none') data(iris) iris.pca <- prcomp (~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris, scale. = TRUE) ggbiplot(iris.pca, obs.scale = 1, var.scale = 1, groups = iris$Species, point.size=2, varname.size = 5, varname.color = "black", varname.adjust = 1.2, ellipse = TRUE, circle = TRUE) + labs(fill = "Species", color = "Species") + theme_minimal(base_size = 14) + theme(legend.direction = 'horizontal', legend.position = 'top')data(wine) library(ggplot2) wine.pca <- prcomp(wine, scale. = TRUE) ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, varname.size = 4, groups = wine.class, ellipse = TRUE, circle = TRUE) # Easier interpretation if the axes are reflected wine.pca <- reflect(wine.pca) # Use direct labels rather than a legend means <- aggregate(cbind(PC1, PC2) ~ wine.class, data = wine.pca$x, FUN = mean) ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, varname.size = 4, ellipse = TRUE, circle = TRUE) + geom_label(data = means, aes(x=PC1, y=PC2, label = wine.class)) + theme(legend.position = 'none') data(iris) iris.pca <- prcomp (~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris, scale. = TRUE) ggbiplot(iris.pca, obs.scale = 1, var.scale = 1, groups = iris$Species, point.size=2, varname.size = 5, varname.color = "black", varname.adjust = 1.2, ellipse = TRUE, circle = TRUE) + labs(fill = "Species", color = "Species") + theme_minimal(base_size = 14) + theme(legend.direction = 'horizontal', legend.position = 'top')
Produces scree plots (Cattell, 1966) of the variance proportions explained by each dimension against dimension number from various PCA-like dimension reduction techniques.
ggscreeplot( pcobj, type = c("pev", "cev"), size = 4, shape = 19, color = "black", linetype = 1, linewidth = 1 )ggscreeplot( pcobj, type = c("pev", "cev"), size = 4, shape = 19, color = "black", linetype = 1, linewidth = 1 )
pcobj |
an object returned by |
type |
the type of scree plot, one of |
size |
point size |
shape |
shape of the points. Default: 19, a filled circle. |
color |
color for points and line. Default: |
linetype |
type of line |
linewidth |
width of line |
A ggplot2 object with the aesthetics x = PC, y = yvar
Cattell, R. B. (1966). The Scree Test For The Number Of Factors. Multivariate Behavioral Research, 1, 245–276.
library(ggplot2) data(wine) wine.pca <- prcomp(wine, scale. = TRUE) ggscreeplot(wine.pca) # show horizontal lines for 80, 90% of cumulative variance ggscreeplot(wine.pca, type = "cev") + geom_hline(yintercept = c(0.8, 0.9), color = "blue") # Make a fancy screeplot, higlighting the scree starting at component 4 data(crime) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) (crime.eig <- crime.pca |> broom::tidy(matrix = "eigenvalues")) ggscreeplot(crime.pca) + stat_smooth(data = crime.eig |> dplyr::filter(PC>=4), aes(x=PC, y=percent), method = "lm", se = FALSE, fullrange = TRUE)library(ggplot2) data(wine) wine.pca <- prcomp(wine, scale. = TRUE) ggscreeplot(wine.pca) # show horizontal lines for 80, 90% of cumulative variance ggscreeplot(wine.pca, type = "cev") + geom_hline(yintercept = c(0.8, 0.9), color = "blue") # Make a fancy screeplot, higlighting the scree starting at component 4 data(crime) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) (crime.eig <- crime.pca |> broom::tidy(matrix = "eigenvalues")) ggscreeplot(crime.pca) + stat_smooth(data = crime.eig |> dplyr::filter(PC>=4), aes(x=PC, y=percent), method = "lm", se = FALSE, fullrange = TRUE)
Principle component-like objects have variable loadings (the eigenvectors of the covariance/correlation matrix) whose signs are arbitrary, in the sense that a given column can be reflected (multiplied by -1) without changing the fit.
reflect(pcobj, columns = 1:2)reflect(pcobj, columns = 1:2)
pcobj |
|
columns |
a vector of indices of the columns to reflect |
This function allows one to reflect any columns of the variable loadings (and corresponding observation scores). Coordinates for quantitative supplementary variables are also reflected if present. This is often useful for interpreting a biplot, for example when a component (often the first) has all negative signs.
The pca-like object with specified columns of the variable loadings and observation scores multiplied by -1.
Michael Friendly
data(crime) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) biplot(crime.pca) crime.pca <- reflect(crime.pca) # reflect columns 1:2 biplot(crime.pca) iris.lda <- MASS::lda(Species ~ ., data=iris) #reflect the first dimension iris.lda1 <- reflect(iris.lda, columns = 1) # compare predicted scores predict(iris.lda)$x |> head() predict(iris.lda1)$x |> head()data(crime) crime.pca <- crime |> dplyr::select(where(is.numeric)) |> prcomp(scale. = TRUE) biplot(crime.pca) crime.pca <- reflect(crime.pca) # reflect columns 1:2 biplot(crime.pca) iris.lda <- MASS::lda(Species ~ ., data=iris) #reflect the first dimension iris.lda1 <- reflect(iris.lda, columns = 1) # compare predicted scores predict(iris.lda)$x |> head() predict(iris.lda1)$x |> head()
Results of a chemical analysis of wines grown in the same region in Italy, derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each of the three types of wines.
The grape varieties (cultivars), 'barolo', 'barbera', and 'grignolino', are indicated in wine.class.
data(wine)data(wine)
A wine data frame consisting of 178 observations (rows) and
13 columns and vector wine.class of factors indicating the cultivars.
The variables are:
Alcohola numeric vector
MalicAcidMalic acid, a numeric vector
AshAsh, a numeric vector
AlcAshAlcalinity of ash, a numeric vector
MgMagnesium, a numeric vector
Phenolstotal phenols, a numeric vector
FlavFlavanoids, a numeric vector
NonFlavPhenolsNonflavanoid phenols, a numeric vector
ProaProanthocyanins, a numeric vector
ColorColor intensity, a numeric vector
Huea numeric vector
ODD280/OD315 of diluted wines, a numeric vector
Prolinea numeric vector
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Wine)
data(wine) table(wine.class) wine.pca <- prcomp(wine, scale. = TRUE) ggscreeplot(wine.pca) ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE)data(wine) table(wine.class) wine.pca <- prcomp(wine, scale. = TRUE) ggscreeplot(wine.pca) ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, ellipse = TRUE, circle = TRUE)