The function tapply() and ragged arrays

To continue the previous example, suppose we have the incomes of the same tax accountants in another vector (in suitably large units of money)

> incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56, 
               61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
               59, 46, 58, 43)

To calculate the sample mean income for each state we can now use the special function tapply():

incmeans <- tapply(incomes, statef, mean)

giving a means vector with the components labelled by the levels

> incmeans
  act    nsw   nt  qld sa  tas vic    wa 
 44.5 57.333 55.5 53.6 55 60.5  56 52.25
The function tapply() is used to apply a function, here mean(), to each group of components of the first argument, here incomes, defined by the levels of the second component, here statef, as if they were separate vector structures. The result is a structure of the same length as the levels attribute of the factor containing the results. The reader should consult the help document for more details.

Suppose further we needed to calculate the standard errors of the state income means. To do this we need to write an . function to calculate the standard error for any given vector. We discuss functions more fully later in these notes, but since there is an in built function var() to calculate the sample variance, such a function is a very simple one liner, specified by the assignment:

stderr <- function(x) sqrt(var(x)/length(x))

(Writing functions will be considered later in §[*].) After this assignment, the standard errors are calculated by

incster <- tapply(incomes, statef, stderr)

and the values calculated are then

> incster
 act    nsw  nt    qld     sa tas   vic     wa 
 1.5 4.3102 4.5 4.1061 2.7386 0.5 5.244 2.6575

As an exercise you may care to find the usual 95% confidence limits for the state mean incomes. To do this you could use tapply() once more with the length() function to find the sample sizes, and the qt() function to find the percentage points of the appropriate t- distributions.

The function tapply() can be used to handle more complicated indexing of a vector by multiple categories. For example, we might wish to split the tax accountants by both state and sex. However in this simple instance what happens can be thought of as follows. The values in the vector are collected into groups corresponding to the distinct entries in the category. The function is then applied to each of these groups individually. The value is a vector of function results, labelled by the levels attribute of the category.

The combination of a vector and a labelling factor or category is an example of what is called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section.



Jeff Banfield
2/13/1998