Advanced Data Analysis from an Elementary Point of View

Advanced Data Analysis from an Elementary Point of View

Advanced Data Analysis from an Elementary Point of View

by Cosma Rohilla Shalizi

This is a draft textbook on data analysis methods, intended for a one-semester course for advance undergraduate students who have already taken classes in probability, mathematical statistics, and linear regression. It began as the lecture notes for 36-402 at Carnegie Mellon University.

By making this draft generally available, I am not promising to provide any assistance or even clarification whatsoever. Comments are, however, welcome.

The book is under contract to Cambridge University Press; it should be turned over to the press at the end of 2013 or beginning of 2014 in early before the end of 2015. A copy of the next-to-final version will remain freely accessible here permanently.

Complete draft in PDF

Table of contents:

I. Regression and Its Generalizations
Regression Basics
The Truth about Linear Regression
Model Evaluation
Smoothing in Regression
The Bootstrap
Weighting and Variance
Additive Models
Testing Regression Specifications
Logistic Regression
Generalized Linear Models and Generalized Additive Models
Classification and Regression Trees
II. Distributions and Latent Structure
Density Estimation
Relative Distributions and Smooth Tests of Goodness-of-Fit
Principal Components Analysis
Factor Models
Nonlinear Dimensionality Reduction
Mixture Models
Graphical Models
III. Dependent Data
Time Series
Spatial and Network Data
Simulation-Based Inference
IV. Causal Inference
Graphical Causal Models
Identifying Causal Effects
Causal Inference from Experiments
Estimating Causal Effects
Discovering Causal Structure
Data-Analysis Problem Sets
Reminders from Linear Algebra
Big O and Little o Notation
Taylor Expansions
Multivariate Distributions
Algebra with Expectations and Variances
Propagation of Error, and Standard Errors for Derived Quantities
chi-squared and the Likelihood Ratio Test
Proof of the Gauss-Markov Theorem
Rudimentary Graph Theory
Information Theory
Hypothesis Testing
Writing R Functions
Random Variable Generation
Planned changes:

Unified treatment of information-theoretic topics (relative entropy / Kullback-Leibler divergence, entropy, mutual information and independence, hypothesis-testing interpretations) in an appendix, with references from chapters on density estimation, on EM, and on independence testing
More detailed treatment of calibration and calibration-checking (part II)
Missing data and imputation (part II)
Move d-separation material from “causal models” chapter to graphical models chapter as no specifically causal content (parts II and IV)?
Expand treatment of partial identification for causal inference, including partial identification of effects by looking at all data-compatible DAGs (part IV)
Figure out how to cut at least 50 pages
Make sure notation is consistent throughout: insist that vectors are always matrices, or use more geometric notation?
Move simulation to an appendix
Move variance/weights chapter to right before logistic regression
Move some appendices online (i.e., after references)?
(Text last updated 30 March 2016; this page last updated 6 November 2015)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.