R: Basics of R objects; entering and manipulating data
http://www.psychol.cam.ac.uk/statistics/R/enteringdata.html

Index

  1. Basics of R objects
  2. Accessing and manipulating R objects
  3. Entering and editing data by hand
  4. Reshaping data frames

Basics of R objects

Very basic things:

x <- 1 # enters a single number into x (<- and = are alternative/synonymous assignment operators)

v <- c(1,2,3,4,5) # c for combine/concatenate; this creates a list or unidimensional vector

temp <- v>3 # will make temp equal to the logical vector c(FALSE, FALSE, FALSE, TRUE, TRUE) by performing comparisons on each element of v

z <- c(1:3,NA) # the fourth element of z is the "missing" value, NA (not available)
is.na(z) # returns c(FALSE, FALSE, FALSE, TRUE)

m <- matrix( c(1,2,3,4,5,6), nrow=2, ncol=3, byrow=FALSE)
# byrow=FALSE is the default, meaning that data go into the matrix filling up one column top to bottom before starting the next.
# This makes m the following 2x3 matrix:
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Using sequences and repetition:

1:9 # same as c(1,2,3,4,5,6,7,8,9)
1.5:10 # same as c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5)

seq(1,5) # same as 1:5
seq(1,5,by=0.5) # gives 1.0, 1.5, 2.0, ... 5.0

rep(5,10) # repeats the value 5 ten times
rep(c("A","B","C"),2) # same as c("A", "B", "C", "A", "B", "C")
matrix(rep(0,16),nrow=r) # gives a 4x4 matrix of zeroes
matrix() # gives a blank matrix (1x1 containing the missing value "NA")

# To concatenate strings, use paste():
paste("A", "B", "C") # This gives the single item "ABC" - whereas c("A","B","C") gives a list of three items.

# You can also get fancy:
labels <- paste(c("X","Y"), 1:10, sep="") # same as labels <- c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10")
# Note that a short list of two items is recycled until it's as long as the longest list (of 10 items).

Addressing vectors and matrices:

v[4] # 4th element of vector v (note: first element is element 1)
v[c(2,4)] # second and fourth element of v
m[,4] # 4th column of matrix
m[4,] # 4th row of matrix
m[2:4,1:3] # rows 2-4 of columns 1-3
v[!is.na[v]] # all elements of v that are not NA, in the same order as before

Factors:

grouplist <- c("sham", "lesion", "sham", "sham", "lesion", "sham", "lesion")
groupfactor <- factor(grouplist) # makes a factor (the same length as grouplist, but marked as a factor)
levels(groupfactor) # will produce "sham", "lesion"

tapply(var, fac, func) # applies function "func" to all elements of "var", grouped by "fac"
# example:
# a <- c(1,2,3,4,5,6,7,8,9)
# f <- factor(c("a","a","a","b","b","b","c","c","c"))
# tapply(a, f, mean)

# Factors can be given names (labels) as well as their original numeric values:
# ... Suppose variable v1 is coded 1, 2 or 3, and we want to attach value labels 1=red, 2=blue, 3=green:
mydata$v1 <- factor(mydata$v1, levels = c(1,2,3), labels = c("red", "blue", "green"))
# ... Or, to make the factor ordered (i.e. R knows there's an order to the levels):
mydata$v1 <- ordered(mydata$y, levels = c(1,3, 5), labels = c("Low", "Medium", "High"))
# Use factors for nominal data, and ordered factors for ordinal data.

Lists:

mylist <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))
# can now use mylist[[1]] to get "Fred"
# ... or mylist$name to get "Fred"
# ... or mylist$child.ages[2] to get 7

Other:

attributes(object) # views the attributes of "object"
attr(object, name) # gets/sets a specific attribute (by name) of object "object"

Data frames:

mydata <- data.frame(x = c(20,35,45,55,70), n = rep(50,5), y = c(6,17,26,37,44))

# mydata is a data frame. It is like a matrix with three column variables (mydata$x, mydata$n, mydata$y).
# A data frame is like a matrix in which the columns may be of different types (e.g. numerical variable, factor, text).
# Lots of R tests use data frames.
# Let's look at it:

mydata
   x  n  y
1 20 50  6
2 35 50 17
3 45 50 26
4 55 50 37
5 70 50 44
attributes(mydata)
$names
[1] "x" "n" "y"

$row.names
[1] 1 2 3 4 5

$class
[1] "data.frame"

We can make the names prettier:

attr(mydata, "row.names") <- c("pointone","pointtwo","pointthree","pointfour","pointfive")
mydata
            x  n  y
pointone   20 50  6
pointtwo   35 50 17
pointthree 45 50 26
pointfour  55 50 37
pointfive  70 50 44

A few more handy functions:

names(mydata) # Lists the names of variables within mydata (also visible with attributes(), as above).
str(mydata) # Shows the structure ("5 obs[ervations] of 3 variables... x is ...  f is a factor with 2 levels..." etc.)
levels(factor) # Shows the levels of a factor
dim(object) # Shows an object's dimensions
class(object) # Shows an object's class (e.g. numeric, matrix, dataframe)

head(mydata, n=3) # Shows the first 3 rows
tail(mydata, n=3) # Shows the last 3 rows

subset(mydata, x>=45) # Pick a subset
subset(mydata, x!=45, c(x,y)) # ... or another subset... there are lots of things you can do with this command; see ?subset.

It's easy to sort data frames and to create new variables based on existing ones. See Quick-R / Data Management.

Accessing and manipulating R objects

Making a data set visible on the main search path:

attach(my.dataset) # we now don't need to use my.dataset$X, my.dataset$Y; we can just use X and Y directly

# Note that to change variables in the dataset, you still need to assign to dataset$var
# (otherwise a new variable called var is created that simply "overlies" the dataset.

# By the way, get used to the R convention: my.dataset is just a variable name; the dot doesn't mean anything special.
search() # shows the current search path (will now include my.dataset)
detach(my.dataset) # when we've finished with it

# Another way, which has no residual effects:
with(my.dataset, {
    # do stuff...
})

Other important object manipulation functions:

ls() # list all objects (if you know UNIX, this will be familiar)
rm(x) # removes object "x" (if you know UNIX, this will be familiar)

Entering and editing data by hand

Typing stuff in; note also that filenames and URLs are often interchangeable:

x <- scan() # type in numbers, separated by spaces or newlines; hit Enter twice to finish
x <- scan(filename) # do the same but reading from a file on disk
x <- scan("http://www...") # the same, but from a URL (live)

Editing a variable, matrix, or data frame:

y <- edit(x)
fix(x) # equivalent to x <- edit(x)

In the R Commander, you can click the Data set button to select a data set, and then click the Edit data set button.

For more advanced data manipulation in R Commander, explore the Data menu, particularly the Data / Active data set and Data / Manage variables in active data set menus.

Reshaping data frames

Often, you need to transform data between "wide" format (e.g. one row per subject; multiple observations/column per subject) and "long" format (one observation per row). R uses "long" format for most analyses. There are several methods; reshape is powerful.

See ?reshape for full details. But let's glance at long-to-wide transformation:

Indometh # one of the built-in R datasets. It's in "long" format.
   Subject time conc
1        1 0.25 1.50
2        1 0.50 0.94
3        1 0.75 0.78
4        1 1.00 0.48
5        1 1.25 0.37
6        1 2.00 0.19
7        1 3.00 0.12
8        1 4.00 0.11
9        1 5.00 0.08
10       1 6.00 0.07
11       1 8.00 0.05
12       2 0.25 2.03
13       2 0.50 1.63
14       2 0.75 0.71
...
# Let's reshape it as follows. Key things:
# We start with one observation per row. We want to group them together by some variable that identifies an individual (group of observations).
# 1. Keep SUBJECT as the identifying variable, one per row (idvar)
# 2. Columns are labelled with TIME (timevar)
# 3. VALUES of CONC are spread "wide" (v.names).
#    If v.names are not specified, all variables apart from idvar and timevar are assumed to vary, and are spread wide.
# Any gaps will be filled by "NA" values.

wide <- reshape(Indometh, v.names="conc", idvar="Subject", timevar="time", direction="wide")
wide
         Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3 conc.4 conc.5 conc.6 conc.8
      1        1      1.50     0.94      0.78   0.48      0.37   0.19   0.12   0.11   0.08   0.07   0.05
      12       2      2.03     1.63      0.71   0.70      0.64   0.36   0.32   0.20   0.25   0.12   0.08
      23       3      2.72     1.49      1.16   0.80      0.80   0.39   0.22   0.12   0.11   0.08   0.08
      34       4      1.85     1.39      1.02   0.89      0.59   0.40   0.16   0.11   0.10   0.07   0.07
      45       5      2.05     1.04      0.81   0.39      0.30   0.23   0.13   0.11   0.08   0.10   0.06
      56       6      2.31     1.44      1.03   0.84      0.64   0.42   0.24   0.17   0.13   0.10   0.09
long <- reshape(wide, direction="long") # reverses the effect completely (by using information stored within wide about the original reshaping)

And an example of a more complex long-to-wide transformation: creating a fictional data frame with one between-subject factor (A) and two within-subject factors (U, V), in long format, and then reshaping it whilst controlling the resulting column names carefully:

# First, make up a fictional dataset:
s = 10; levels_U = 3; levels_V = 2; levels_A = 2;
mean_U1V1A1 = 5; mean_U2V1A1 = 6; mean_U3V1A1 = 5.5;
mean_U1V2A1 = 7; mean_U2V2A1 = 8; mean_U3V2A1 = 7.5;
mean_U1V1A2 = 5; mean_U2V1A2 = 6; mean_U3V1A2 = 10.5;
mean_U1V2A2 = 7; mean_U2V2A2 = 8; mean_U3V2A2 = 10.5;
noise_sd = 1;
data9 = data.frame( S = paste("S", rep(1:(s*levels_A), each=levels_U*levels_V, times=1), sep=""),
                    U = rep(paste("U", 1:levels_U, sep=""), each=1, times=s*levels_V*levels_A),
                    V = rep(paste("V", 1:levels_V, sep=""), each=levels_U, times=levels_A*s),
                    A = rep(paste("A", 1:levels_A, sep=""), each=levels_U*levels_V*s, times=1),
                    depvar = c( rep( c(mean_U1V1A1, mean_U2V1A1, mean_U3V1A1,
                                       mean_U1V2A1, mean_U2V2A1, mean_U3V2A1), each=1, times=s),
                                rep( c(mean_U1V1A2, mean_U2V1A2, mean_U3V1A2,
                                       mean_U1V2A2, mean_U2V2A2, mean_U3V2A2), each=1, times=s) )
                             + rnorm(s*levels_U*levels_A*levels_B, mean=0, sd=noise_sd)
)

head(data9)
   S  U  V  A   depvar
1 S1 U1 V1 A1 4.262872
2 S1 U2 V1 A1 6.201466
3 S1 U3 V1 A1 6.593957
4 S1 U1 V2 A1 6.482112
5 S1 U2 V2 A1 6.894665
6 S1 U3 V2 A1 7.686345
# Handy function to strip prefixes from the column names (of cosmetic value only!) since the reshape process will add a prefix:
colnames_removing_prefix <- function(df, prefix) {
	names <- colnames(df)
	indices <- (substr(names,1,nchar(prefix))==prefix)
	names[indices] <- substr(names[indices], nchar(prefix)+1, nchar(names[indices]))
	return(names)
}

# Now, reshape our data frame:
data9$compositewsvar = factor(paste(data9$U, data9$V, sep=""))
data9wide <- reshape(data9, v.names="depvar", idvar="S", timevar="compositewsvar", drop=c("U","V"), direction="wide") # reshape to wide format
colnames(data9wide) <- colnames_removing_prefix(data9wide, "depvar.")

head(data9wide)
    S  A     U1V1     U2V1     U3V1     U1V2     U2V2     U3V2
1  S1 A1 4.262872 6.201466 6.593957 6.482112 6.894665 7.686345
7  S2 A1 4.429626 6.334341 5.483475 6.945418 8.714208 7.750707
13 S3 A1 5.634549 5.350344 6.694130 5.411150 8.461070 8.738355
19 S4 A1 6.573824 6.171209 3.898931 6.334141 7.658620 7.547592
25 S5 A1 4.377890 6.585668 5.870754 6.541264 8.790811 5.967742
31 S6 A1 4.578090 4.735252 5.940648 5.163605 6.803640 8.000925

And wide-to-long transformation:

dfwide <- data.frame(id=1:4, age=c(40,50,60,50), dose1=c(1,2,1,2), dose2=c(2,1,2,1), dose4=c(3,3,3,3))
dfwide
  id age dose1 dose2 dose4
1  1  40     1     2     3
2  2  50     2     1     3
3  3  60     1     2     3
4  4  50     2     1     3
# Key things:
# We start with one individual per row. Some variables represent observations that we want to re-group into a variable
# with the observation, and other variable(s) describing what sort of observation is on that row.
# 1. We say which columns need to be regrouped (varying; in this case columns 3:5).
# 2. By default, the "label" column that's created is called time.
# 3. By default, the program assumes that the current column labels take the form "x.1", "x.2", "y.1", "y.2".
#    In this case (separator as "."), columns labelled "x" and "y" will be created, with "time" values of 1, 2, etc.
#    In this example, we use sep="" instead to show that the number follows the alphanumeric part directly.

long <- reshape(dfwide, direction="long", varying=3:5, sep="")
long
    id age time dose
1.1  1  40    1    1
2.1  2  50    1    2
3.1  3  60    1    1
4.1  4  50    1    2
1.2  1  40    2    2
2.2  2  50    2    1
3.2  3  60    2    2
4.2  4  50    2    1
1.4  1  40    4    3
2.4  2  50    4    3
3.4  3  60    4    3
4.4  4  50    4    3

See ?reshape for further ways to control the process.

Valid HTML 4.01 Transitional
Valid CSS