| Title: | Datasets from Computer Age Statistical Inference |
|---|---|
| Description: | Provides the datasets from Efron & Hastie (2016, ISBN: 9781108107952), "Computer Age Statistical Inference: Algorithms, Evidence, and Data Science", in an accessible R format for those who want to use them for study or to try to reproduce analyses from the book. |
| Authors: | Michael Friendly [aut, cre] |
| Maintainer: | Michael Friendly <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.2.1 |
| Built: | 2026-05-11 05:47:21 UTC |
| Source: | https://github.com/friendly/CASIdata |
Data on amyotrophic lateral sclerosis (Lou Gehrig's disease) from Section 17.2. There are 1822 observations on individuals with ALS. The goal is to predict the rate of progression dFRS of a functional rating score, using 369 predictors based on measurements (and derivatives of these) obtained from patient visits.
A data frame with 1822 rows and 371 variables. The key variables are
testset (logical indicator for training/test split) and dFRS
(response: rate of progression of the ALS functional rating score). The 369
predictor variables include:
Demographics: Age, Sex.Male, Sex.Female, and
race indicators (Race...Caucasian, Race...Asian, etc.)
Family history of neurological diseases in relatives (e.g.,
Father, Mother, Brother, Sister)
Neurological disease indicators (e.g., Neurological.Disease.ALS,
Neurological.Disease.PARKINSON.S.DISEASE)
Site of onset (Site.of.Onset.Onset..Bulbar,
Site.of.Onset.Onset..Limb)
Symptoms (Symptom.Atrophy, Symptom.Cramps,
Symptom.Fasciculations, Symptom.Speech, etc.)
Study arm indicators (Study.Arm.ACTIVE, Study.Arm.PLACEBO)
Clinical measurements with summary statistics (first, last, min, max,
mean, sd, slope): ALSFRS scores, blood pressure, forced/slow vital capacity
(fvc.liters, svc.liters), respiratory rate, weight, height
ALSFRS subscale items: climbing.stairs, cutting,
dressing, handwriting, salivation, speech,
swallowing, turning, walking
These data were kindly provided by Lester Mackey and Lilly Fang, who won the DREAM challenge prediction prize in 2012 (Kuffner et al., 2015). It includes some additional variables created by them. Their winning entry used Bayesian trees, not too different from random forests.
https://hastie.su.domains/CASI_files/DATA/ALS.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Section 17.2.
data(als) str(als)data(als) str(als)
Batting averages for 18 Major League players in the 1970 season, from Table 7.1. This dataset illustrates empirical Bayes estimation, where early-season performance is used to predict full-season batting averages.
A data frame with 18 rows and 3 variables:
Player ID number
Batting average based on the first 90 at-bats of the season
Batting average for the remainder of the 1970 season
https://hastie.su.domains/CASI_files/DATA/baseball.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Table 7.1.
data(baseball) str(baseball)data(baseball) str(baseball)
40 points generated from a bivariate normal distribution, with some entries missing. From Figure 9.3.
A data frame with 40 rows and 2 variables:
First variable
Second variable
https://hastie.su.domains/CASI_files/DATA/bivnorm.csv
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Figure 9.3.
data(bivnorm) str(bivnorm)data(bivnorm) str(bivnorm)
Number of butterfly species seen a given number of times each in two years of trapping. From Table 6.2. This is a frequency data frame.
A data frame with 24 rows and 2 variables:
Number of times a species was trapped
Number of species seen exactly k times (e.g., 118 species trapped just once, 74 trapped twice each)
https://hastie.su.domains/CASI_files/DATA/butterfly.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Table 6.2.
data(butterfly) str(butterfly)data(butterfly) str(butterfly)
Human cell colonies infused with mouse nuclei in 5 different ratios over 1 to 5 days. From Table 8.2.
A data frame with 25 rows and 4 variables:
Number of cells that thrived
Colony size (number of cells)
Ratio of mouse nuclei to human cells (1-5)
Day of observation (1-5)
https://hastie.su.domains/CASI_files/DATA/cellinfusion.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Table 8.2.
data(cellinfusion) str(cellinfusion)data(cellinfusion) str(cellinfusion)
Cholestyramine, a proposed cholesterol lowering drug, was administered to 164 men for an average of seven years each. From Figure 20.1.
A data frame with 164 rows and 2 variables:
Fraction of intended dose actually taken (standardized)
Decrease in cholesterol level over the course of the experiment
https://hastie.su.domains/CASI_files/DATA/cholesterol.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Figure 20.1.
data(cholesterol) str(cholesterol)data(cholesterol) str(cholesterol)
Data from 442 diabetes patients used in Section 7.3. The response is a quantitative measure of disease progression one year after baseline. There are ten baseline predictors: age, sex, body-mass index, average blood pressure, and six blood serum measurements.
A data frame with 442 rows and 12 variables:
Row index
Age of patient
Sex of patient
Body mass index
Average blood pressure (mean arterial pressure)
Total cholesterol (serum measurement)
Low-density lipoproteins (serum measurement)
High-density lipoproteins (serum measurement)
Total cholesterol / HDL (serum measurement)
Log of triglycerides (serum measurement)
Blood sugar level (serum measurement)
Response: quantitative measure of disease progression
First used in the LARS paper (Efron et al., 2004).
Note: In Table 7.2, the centered predictor variables were standardized to unit L2 norm. In Table 20.1 they were standardized to unit variance.
https://hastie.su.domains/CASI_files/DATA/diabetes.csv
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least Angle Regression. Annals of Statistics, 32(2), 407-499.
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Section 7.3.
data(diabetes) str(diabetes)data(diabetes) str(diabetes)
Data from 11 groups of mice (10 each) exposed to drug Xilathon at different doses. From Figure 8.2.
A data frame with 11 rows and 2 variables:
Log dose level (each step is a doubling)
Proportion of mice that died at that dose
https://hastie.su.domains/CASI_files/DATA/doseresponse.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Figure 8.2.
data(doseresponse) str(doseresponse)data(doseresponse) str(doseresponse)
Diffusion Tensor Imaging (DTI) data comparing 6 dyslexic children with 6 normal controls, from Figures 15.9 and 15.10. Z scores were computed at 15,443 three-dimensional brain coordinates (voxels).
A data frame with 15443 rows and 4 variables:
Voxel coordinate: back to front
Voxel coordinate: left to right
Voxel coordinate: bottom to top
Z score comparing dyslexic vs normal controls at this voxel
https://hastie.su.domains/CASI_files/DATA/DTI.csv
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Figures 15.9 and 15.10.
data(DTI) str(DTI)data(DTI) str(DTI)
Counts of galaxies binned by redshift and magnitude, from Table 8.5. The data have been reshaped into long format with variables for magnitude, redshift category, and frequency count.
A data frame with 270 rows and 3 variables:
Magnitude category (1-18)
Redshift category (1-15)
Number of galaxies in this bin
https://hastie.su.domains/CASI_files/DATA/galaxy.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Table 8.5.
data(galaxy) str(galaxy) library(car) ## Fit a main effects Poisson GLM # This treats `mag` and `red` as numeric galaxy.mod0 <- glm(freq ~ mag + red, data = galaxy, family = poisson) Anova(galaxy.mod0) ## Fit response surface model galaxy.mod1 <- glm(freq ~ poly(mag,2) + poly(red, 2) + mag : red, data = galaxy, family = poisson) Anova(galaxy.mod1) summary(galaxy.mod1)data(galaxy) str(galaxy) library(car) ## Fit a main effects Poisson GLM # This treats `mag` and `red` as numeric galaxy.mod0 <- glm(freq ~ mag + red, data = galaxy, family = poisson) Anova(galaxy.mod0) ## Fit response surface model galaxy.mod1 <- glm(freq ~ poly(mag,2) + poly(red, 2) + mag : red, data = galaxy, family = poisson) Anova(galaxy.mod1) summary(galaxy.mod1)
Genotype data for 197 US individuals from 4 racial groups (African American, European, Japanese, and African) at 100 SNP locations. From Section 13.5.
A data frame with 197 rows and 102 variables. The first column
X is a row index, race is the racial/ethnic group identifier,
and the remaining 100 columns (Snp1 through Snp100) contain
genotype values (0, 1, or 2) at each SNP location, with NA for
missing values.
https://hastie.su.domains/CASI_files/DATA/haplotype.csv
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Section 13.5.
data(haplotype) str(haplotype)data(haplotype) str(haplotype)
Insurance company life table from Table 9.1. At each age, gives the number of policy holders and the number of deaths.
A data frame with rows for each age group and 3 variables:
Age of policy holders
Number of policy holders at this age
Number of deaths at this age
https://hastie.su.domains/CASI_files/DATA/insurance.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Table 9.1.
data(insurance) str(insurance)data(insurance) str(insurance)
Gene expression measurements on 72 leukemia patients: 47 ALL (acute lymphoblastic leukemia) and 25 AML (acute myeloid leukemia). From the landmark Golub et al. (1999) Science paper. This smaller subset contains 3571 genes and is used in Section 19.1.
A data frame with 3571 rows (genes) and 72 columns (patients). Column names indicate the class label (ALL or AML) for each patient.
A larger dataset with 7128 genes is also available from the CASI website.
https://hastie.su.domains/CASI_files/DATA/leukemia_small.csv
Golub, T.R., et al. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286, 531-537.
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Section 19.1.
data(leukemia_small) str(leukemia_small)data(leukemia_small) str(leukemia_small)
Head and neck cancer survival data from the Northern California Oncology Group (NCOG), from Section 9.2. Patients were randomized to one of two treatment arms.
A data frame with survival time information and variables:
Time in months until death or censoring
Death indicator: 1 = death observed, 0 = censored
Treatment arm: "A" = Chemotherapy, "B" = Chemotherapy + Radiation
Day of event/censoring
Month of event/censoring
Year of event/censoring
https://hastie.su.domains/CASI_files/DATA/ncog.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Section 9.2.
data(ncog) str(ncog)data(ncog) str(ncog)
Data on lymph nodes removed from 844 cancer patients, from Figure 6.3.
A data frame with 844 rows and 2 variables:
Number of lymph nodes removed
Number of nodes found to be positive (malignant)
https://hastie.su.domains/CASI_files/DATA/nodes.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Figure 6.3.
data(nodes) str(nodes)data(nodes) str(nodes)
Survival data on 1620 children with cancer, from Section 9.4 and Table 9.6.
A data frame with 1620 rows and 7 variables:
Sex: 1 = male, 2 = female
Race: 1 = white, 2 = nonwhite
Age in years
Calendar date of entry in days since July 1, 2001
Home distance from treatment center in miles
Survival time in days
Death indicator: 1 = death observed, 0 = censored
https://hastie.su.domains/CASI_files/DATA/pediatric.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Section 9.4, Table 9.6.
data(pediatric) str(pediatric)data(pediatric) str(pediatric)
Z scores for 2749 New York City police officers, from Figure 15.7. A large value suggests racial bias in policing behavior.
A data frame with 2749 rows and 1 variable:
Z score measuring potential racial bias
https://hastie.su.domains/CASI_files/DATA/police.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Figure 15.7.
data(police) str(police)data(police) str(police)
Vector of 6033 z-values comparing gene expression between prostate cancer patients and controls, as pictured in Figure 3.4. These were computed as described on page 272.
A data frame with 6033 rows and 1 variable:
Z-value for each gene comparing cancer vs control expression
https://hastie.su.domains/CASI_files/DATA/prostz.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Section 3.3, Figure 3.4.
data(prostz) str(prostz)data(prostz) str(prostz)
Test scores for 22 students on 5 different exams, from Tables 3.1 and 10.1.
A data frame with 22 rows and 5 variables:
Mechanics exam score
Vectors exam score
Algebra exam score
Analysis exam score
Statistics exam score
https://hastie.su.domains/CASI_files/DATA/student_score.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Tables 3.1 and 10.1.
data(student_score) str(student_score)data(student_score) str(student_score)
Measurements from 39 Type Ia supernovas, from Figure 12.1 and Table 12.1. These supernovas were close enough to Earth to observe their actual magnitudes. The goal is to predict magnitude from spectral energy measurements.
A data frame with 39 rows and 11 variables:
Actual observed magnitude of the supernova
Spectral energy in frequency band 1
Spectral energy in frequency band 2
Spectral energy in frequency band 3
Spectral energy in frequency band 4
Spectral energy in frequency band 5
Spectral energy in frequency band 6
Spectral energy in frequency band 7
Spectral energy in frequency band 8
Spectral energy in frequency band 9
Spectral energy in frequency band 10
https://hastie.su.domains/CASI_files/DATA/supernova.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Figure 12.1, Table 12.1.
data(supernova) str(supernova)data(supernova) str(supernova)
Data on vasoconstriction (lung constriction) response, from Table 13.2.
A data frame with 39 rows and 2 variables:
Volume measurement
Logical: TRUE if constriction occurred, FALSE otherwise
https://hastie.su.domains/CASI_files/DATA/vasoconstriction.txt
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Cambridge University Press, Table 13.2.
data(vasoconstriction) str(vasoconstriction)data(vasoconstriction) str(vasoconstriction)