R: Basic statistics


The tests covered here are primarily those dealt with in the Cambridge NST 1B Psychology undergraduate statistics course (see 2004-5 handout PDF).

  1. Using probability distributions
  2. Exploratory data analysis
  3. Correlation
  4. Basic linear regression
  5. Less common things to do with linear regression
  6. Parametric difference tests: the t test
  7. Nonparametric difference tests
  8. Chi-square test

Using probability distributions

Basic maths covered in the handout:

sum(x) # adds up a vector
log(x) # default is natural log
log(x, base=y) # any sort of log
log2(x) # short cut
log10(x) # short cut
exp(x) # e^x

R provides a large number of functions to deal with probability distributions. They come in sets. A typical set has a density function, a cumulative probability distribution function, a quantile function, and a random number generating function. For example, the dnorm/pnorm/qnorm/rnorm set deals with the normal distribution:

dnorm(x, mean=0, sd=1) # Density function.
# For this example, it is 0 at x=-infinity, reaches a peak of 0.39 at x=0, and descends symmetrically to zero at x=infinity. As for all probability density functions, the total area under the curve is 1.

pnorm(q, mean=0, sd=1) # Cumulative probability distribution function.
# Returns P(X <= q), where X is a random variable of the specified normal distribution. Default mean=0; default SD=1. For example, pnorm(0) = 0.5; pnorm(1.644854) = 0.95.

qnorm(p, mean=0, sd=1) # Quantile function.
# Returns q such that P(X <= q) = p. For example, qnorm(0) = -Inf; qnorm(0.5) = 0; qnorm(0.95) = 1.644854; qnorm(0.975) = 1.959964.

rnorm(n, mean=0, sd=1) # Random number generator.
# Generate n random numbers from the specified normal distribution.

By the way, to see a function plotted, use curve:

curve(dnorm(x), from=-5, to=5)

Just a few other distribution functions (they work in the same general way):

[d|p|q|r]norm() # Normal distribution
[d|p|q|r]t() # Student t distribution
[d|p|q|r]f() # F distribution
[d|p|q|r]chisq() # Chi-square distribution

Exploratory data analysis

Basic descriptive statistics:

fivenum() # Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum)
summary() # Summarizes model fits, and simple data (for which it gives minimum, first quartile, median, mean, third quartile, maximum).
quantile() # calculates any quantile (not just the default of 0%, 25%, 50%, 75%, 100%)
var() # sample variance
sd() # sample standard deviation

# Now to summarize things by factor(s):
install.packages("Hmisc") # if not yet done
library(Hmisc) # get the Hmisc library, for its "summary.formula" function
summary(depvar ~ factor(s), data=mydataframe) # because you've passed a formula to summary, it calls summary.formula

Counts, boxplots, histograms, and so forth:

table(x) # shows counts of each value in x (producing a unidimensional list of frequencies with the values as dimnames)
table(x)/length(x) # shows relative frequencies
barplot(table(x)) # by virtue of the way table() produces its results, this produces a frequency histogram

hist(x) # a more general way of doing a histogram (also deals with continuous data).
  # Use "prob=TRUE" to plot relative frequencies (so the area under the graph is 1).
  # The "breaks" option can be used for manual control of the number of bins.

stem(x) # Stem-and-leaf plot
boxplot(x) # Box-and-whisker plot

plot(ecdf(x)) # Plot the empirical cumulative distribution function for x
qqnorm(x) # Creates a normal Q-Q plot (plotting the quantiles of x against the quantiles of a normal distribution).
qqline(x) # Adds a reference line for the Q-Q plot.



plot(x, y) # Basic scatterplot

pairs(d) # Creates scatterplots for every pair of variables in a dataframe d


cov(x, y)
# Also possible to use cov(x), where x is a matrix/dataframe, to produce all possible covariances between variables (columns).

Pearson product-moment correlation coefficient, r:

cor(x, y)
# Also possible to use cor(x), where x is a matrix/dataframe, to produce all possible correlations between variables (columns).

Correlation with t test of significance, plus confidence intervals:

cor.test(x, y)

To perform significance testing on a matrix of correlations, see http://tolstoy.newcastle.edu.au/R/help/01c/2272.html.

Spearman's rho (and significance test), a nonparametric test:

cor.test(x, y, method="spearman")

Basic linear regression

Proceed as follows. First, let's perform the regression. Use lm (for linear model):

fit <- lm(y ~ x) # y is predicted by x

Now, let's see what that did:

 [1] "coefficients"  "residuals"     "effects"       "rank"
 [5] "fitted.values" "assign"        "qr"            "df.residual"
 [9] "xlevels"       "call"          "terms"         "model"

 [1] "lm"

Show the intercept and slope:

fit # this is the object we just stored the results in

Summary, with intercept/slope and significance testing by t test (also gives r2 and adjusted r2):


Significance testing by analysis of variance (regression ANOVA):


Confidence intervals (default: 95% confidence intervals) on the intercept/slope:


Scatterplot with regression line:

plot(x, y)

Residuals plot:

plot(x, fit$residuals)
abline(0,0) # add a reference line at 0

Less common things to do with linear regression

Following on from the above...

Confidence intervals on predictions (prediction intervals) for new x values (example values of 1 and 5 used here):

predict.lm(fit, newdata=data.frame(x=c(1,5)), interval="prediction")
# there's also interval="confidence", but that's something else (with much tighter bounds); I'm not sure exactly what.

Partial correlations:

# Quite a few ways to do partial correlations. A simple one is to put them all as variables in a data matrix,
# and use partial.cor to get partial correlations between every pair, controlling for all other variables.
library(Rcmdr) # partial.cor is part of the Rcmdr library
library(car) # example data
data(DavisThin) # example data


Point-biserial correlation and phi: see handout, and use cor() as normal!

Parametric difference tests: the t test

Basic t-tests, equal variances not assumed, and with the Welsh df correction:

# Independent two-group t-test
t.test(y ~ x) # where y is numeric and x is a binary factor

# Independent two-group t-test
t.test(y1, y2) # where y1 and y2 are numeric

# Paired t-test
t.test(y1, y2, paired=TRUE) # where y1 and y2 are numeric

# One sample t-test
t.test(y, mu=3) # null hypothesis: mean (mu) = 3

# t-test, within a data frame, optionally specifying a subset of the data
t.test(y ~ x, data=mydataframe, subset=sex=="M") # where y is numeric, x is a binary factor, and you want to look at cases where "sex" is "M"

Confidence intervals are given by default.

To specify equal variances and a pooled variance estimate:

# T-test with equal variances assumed (one example):
t.test(y1, y2, var.equal=TRUE)

To specify a one-tailed t-test, use this sort of modification:

# One-tailed t test: is y1 less than y2 (one example)?
t.test(y1, y2, alternative="less")
# ... or the other way round (don't do both!)
t.test(y1, y2, alternative="greater")

F test to compare two variances:

# F test to compare two variances.
var.test(y1, y2)

Nonparametric difference tests


y <- c(1,2,3,4,100)



Mann-Whitney U test (or, Wilcoxon rank sum test - logically equivalent; see handout PDF above) for two independent samples:

# Mann-Whitney U test (or, Wilcoxon rank sum test) for two independent groups. Two ways of doing the same thing:

y1 <- c(1.68, 3.83, 3.11, 2.76, 1.70, 2.79, 3.05, 2.66, 1.40, 2.775) # example data
y2 <- c(2.94, 3.38, 4.90, 2.81, 2.80, 3.21, 3.08, 2.95) # example data
wilcox.test(y1, y2, paired=FALSE) # the "paired=FALSE" is the default, so can be omitted

# Can also use alternative="two.sided"/"greater"/"less" to specify null hypothesis.
# The output gives W, the Wilcoxon rank sum statistic (use [d|p|q|r]wilcox to use distribution functions by hand, if you want).
# The way R calculates W is the same method as given for the Mann-Whitney U in the PDF handout linked to above.

y3 <- c(1.68, 3.83, 3.11, 2.76, 1.70, 2.79, 3.05, 2.66, 1.40, 2.775, 2.94, 3.38, 4.90, 2.81, 2.80, 3.21, 3.08, 2.95) # example data
a <- factor(c(rep(1,10),rep(2,8)))
wilcox.test(y3 ~ a)

Wilcoxon matched-pairs signed-rank test (for two related samples):

y1 <- c(130, 148, 170, 125, 170, 130, 130, 145, 119, 160) # example data
y2 <- c(120, 148, 163, 120, 135, 143, 136, 144, 119, 120) # example data
wilcox.test(y1, y2, paired=TRUE)

# Options to specify one- or two-tailed tests are as above.
# The output gives V, the Wilcoxon signed rank statistic (use [d|p|q|r]signrank to use distribution functions by hand, if you want).

Chi-square test

Specify the contingency table in a matrix. We'll call it C. Note that R fills matrices down columns, then across rows, by default (though you can use "byrow=TRUE" to alter that).

C<-matrix(c(30,60,70,30), nrow=2)
C # display the matrix
     [,1] [,2]
[1,]   30   70
[2,]   60   30

Then run the test, as shown below. If the matrix is a vector, a goodness-of-fit test is performed (with the null hypothesis that all categories are equally likely. If the matrix has at least two rows and columns, then a contingency test is performed - by default, the expected values are calculated as (row total x column total)/grand total, as is typical in such tests. Optionally, a second matrix can be given to specify the expected values explicitly. The correct parameter, true by default, specifies whether Yates' continuity correction is applied.

# Chi-square test

To see the observed and expected values, store the output and interrogate it:


Note that if some expected values are <5, it is preferable to use Fisher's exact test:

# Fisher's Exact Test for count data

Is my die true? Another chi-square example:

counts <- c(41, 50, 67, 47, 25, 50) # rolls of a die: number of ones, twos, threes...
probs <- rep(1/6, 6)
chisq.test(counts, p = probs) # actually, specifying these particular probabilities is optional, since the default is to test against equiprobable categories anyway
Valid HTML 4.01 Transitional
Valid CSS