41
votes

I've looked through a number of posts about ordering factors, but haven't quite found a match for my problem. Unfortunately, my knowledge of R is still pretty rudimentary.

I have a subset of an archaeological artifact catalog that I'm working with. I'm trying to cross-tabulate diagnostic historical artifact types and site testing locations. Easy enough with ddply or tapply.

My problem is that I want to sort the artifact types (a factor) by their mean diagnostic date (number/year), and I keep getting them alphabetically. I know I need to make it an ordered factor, but can't figure out how to order it by the year value in the other column.

IDENTIFY                                      MIDDATE
engine-turned fine red stoneware              1769
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
yellow lead-glazed buff earthenware           1732
...

Which needs to be ordered:

IDENTIFY                                      MIDDATE
yellow lead-glazed buff earthenware           1732
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
engine-turned fine red stoneware              1769
...

The factor (IDENTIFY) needs to be ordered by the Date (MIDDATE). I thought I had it with

Catalog$IDENTIFY<-factor(Catalog$IDENTIFY,levels=Catalog$MIDDATE,ordered=TRUE)

But get the warning:

In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
else paste0(labels,: duplicated levels will not be allowed 
in factors anymore

IDENTIFY has ~130 factor levels and many have the same value for MIDDATE, so I need to order IDENTIFY by MIDDATE and another column TYPENAME.

A little more detail:

I have a dataframe Catalog, which breaks down (i.e. str(Catalog)) as:

> str(Catalog)
'data.frame':   2211 obs. of  15 variables:
 $ TRENCH  : Factor w/ 7 levels "DRT 1","DRT 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ U_TYPE  : Factor w/ 3 levels "EU","INC","STP": 1 1 1 1 1 1 1 1 1 1 ...
 $ U_LBL   : Factor w/ 165 levels "001","005","007",..: 72 72 72 72 72 72 ...
 $ L_STRAT : Factor w/ 217 levels "#2-7/25","[3]",..: 4 4 4 4 4 4 89 89 89 89 ...
 $ START   : num  0 0 0 0 0 0 39.4 39.4 39.4 39.4 ...
 $ END     : num  39.4 39.4 39.4 39.4 39.4 39.4 43.2 43.2 43.2 43.2 ...
 $ Qty     : int  1 1 3 5 1 1 6 8 1 1 ...
 $ MATNAME : Factor w/ 6 levels "Ceramics","Chipped Stone",..: 1 1 1 5 5 6 ...
 $ TYPENAME: Factor w/ 9 levels "Architectural Hardware",..: 9 9 9 1 1 3 9 ...
 $ CATNAME : Factor w/ 32 levels "Biface","Bottle Glass",..: 24 29 29 6 24 ...
 $ IDENTIFY: Factor w/ 112 levels "amethyst bottle glass",..: 17 91 96 71 103 ...
 $ BEGDATE : int  1820 1820 1830 1835 1700 NA 1670 1762 1800 1720 ...
 $ ENDDATE : int  1900 1970 1860 1875 1820 NA 1795 1820 1820 1780 ...
 $ OCC_LBL : Ord.factor w/ 5 levels "Late 19th Century"<..: 2 1 2 2 4 5 4 3 ...
 $ MIDDATE : num  1860 1895 1845 1855 1760 ...

I need to make IDENTIFY an ordered factor, and reorder by MIDDATE -> TYPENAME -> alpha by IDENTIFY.

What I'm not really getting is how to reorder by combined orders from multiple columns.

I would just do this in the db, but a lot of what I'm running are weighted means within all sorts of cross-tabulations (e.g. weighted mean depths below ground surface for artifact classes by location)...

...doable in Access, but messy, and unpredictable. Far easier and cleaner to manage in R, but I'd rather not to have to manually re-sort the resulting tables.

What I'm trying to produce is a number of things along these lines:

>xtab.Catalog<-tapply(Catalog$Qty,list(Catalog$IDENTIFY,Catalog$TRENCH),sum)

IDENTIFY                        DRT1    DRT2    DRT3    DRT4    DRT5    DRT6
Staffordshire stoneware         4       NA      NA      NA      NA      NA  
undecorated delftware           6       4       NA      NA      NA      NA  
unidentified wrought nail       15      9       3       1       3       NA  
white salt-glazed stoneware     6       1       1       NA      2       1   
white salt-glazed scratch blue  1       NA      NA      NA      NA      NA  
white stoneware, slip-dipped    NA      NA      NA      NA      NA      NA  
wrought nail, 'L' head          2       NA      NA      NA      NA      NA  
wrought nail, 'rose' head       62      21      4       NA      1       1   
wrought nail, 'T' head          2       NA      1       NA      NA      1   
yellow lead-glazed              12      NA      NA      NA      1       3   
...

...but I need them to sort in the logical (i.e. chronological/type) order instead of alphabetical.

4
Have a look at the function reorder() in the stats package (i.e. base R).Andrie
Tried reorder(Catalog$IDENTIFY,Catalog$MIDDATE) but got the same warning about duplicate levels and IDENTIFY all turned to NA's...Scard
It's a bit hard to help you without reproducible code. Can you try and make a reproducible example, and I shall have a look?Andrie
Just tacked up more detail on what I'm doing... apologies for my being only semi-literate in R. Had been using SPSS for these types of things, but our office stopped updating it (stuck with a buggy v11).Scard
A minimal reproducible example contains code anyone can run on their machine and get the same error / warning. See stackoverflow.com/q/5963269/602276 for more detail. Anyway, you are getting a warning, not an error, so evaluate your results to see whether it isn't in fact ok as it is.Andrie

4 Answers

45
votes

Here's a reproducible sample, with solution:

set.seed(0)
a = sample(1:20,replace=F)
b = sample(1:20,replace=F)
f = as.factor(letters[1:20])

> a
 [1] 18  6  7 10 15  4 13 14  8 20  1  2  9  5  3 16 12 19 11 17
> b
 [1] 16 18  4 12  3  5  6  1 15 10 19 17  9 11  2  8 20  7 13 14
> f
 [1] a b c d e f g h i j k l m n o p q r s t
Levels: a b c d e f g h i j k l m n o p q r s t

Now for the new factor:

fn = factor(f, levels=unique(f[order(a,b,f)]), ordered=TRUE)

> fn
 [1] a b c d e f g h i j k l m n o p q r s t
20 Levels: k < l < o < f < n < b < c < i < m < d < s < q < g < h < e < ... < j

Sorted on 'a', next 'b' and finally 'f' itself (although in this example, 'a' has no repeated values).

28
votes

I recommend the following dplyr-based approach (h/t daattali) that can be extended to as many columns as you like:

library(dplyr)
Catalog <- Catalog %>%
  arrange(MIDDATE, TYPENAME) %>%               # sort your dataframe
  mutate(IDENTIFY = factor(IDENTIFY, unique(IDENTIFY))) # reset your factor-column based on that order
4
votes

The function fct_reorder2 is doing just that.

Please note the subtlety that fct_reorder is sorting by ascending order while fct_reordering2 is sorting by descending order.

Code from the documentation:

df0 <- tibble::tribble(
  ~color,     ~a, ~b,
  "blue",      1,  2,
  "green",     6,  2,
  "purple",    3,  3,
  "red",       2,  3,
  "yellow",    5,  1

)

df0$color <- factor(df0$color)
fct_reorder(df0$color, df0$a, min)
 #> [1] blue   green  purple red    yellow
 #> Levels: blue red purple yellow green
fct_reorder2(df0$color, df0$a, df0$b)
1
votes

data.table version. As usual a one-liner

library(data.table)
Catalog = data.table(Catalog)
Data[, IDENTIFY := reorder(IDENTIFY, MIDDATE)]

Note: IDENTIFY needs to be a factor already. The main power of this solution comes from the stats::reorder function.