1
votes

This seems pretty basic, but the number of verbs in the tidyverse is huge now and I don't know which package to look for this.

Here is the problem. I have a tibble

df <- tibble(f1 = factor(rep(letters[1:3],5)),
             c1 = rnorm(15))

Now if I use the $ operator I can easily find out how many levels are in the factor.

nlevels(df$f1)
# [1] 3

But if I use the [] operator it returns an incorrect number of levels.

nlevels(df[,"f1"])
# [1] 0

Now if df is a data.frame and not a tibble the nlevels() function works with both the $ operator and the [] operator.

So does anyone know the tidyverse equivalent of nlevels() that works on both data.frames and tibbles?

3
Note that iris[,5] is a vector but as_tibble(iris)[,5] still inherits from a data.frame. This is why nlevels is failing. Alternatives include: nlevels(df$f1), nlevels(df[,"f1",drop=TRUE]), and nlevels(df[["f1"]]).r2evans

3 Answers

3
votes

Elaborating on the answer from timcdlucas (and the comments from r2evans), the issue here is the behavior of various forms of the extract operator, not the behavior of tibble. Why? a tibble is actually a kind of data.frame as illustrated when we use the str() function on a tibble.

> library(dplyr)
> aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
+              c1 = rnorm(15))
> 
> # illustrate that aTibble is actually a type of data frame
> str(aTibble)
tibble [15 × 2] (S3: tbl_df/tbl/data.frame)
 $ f1: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
 $ c1: num [1:15] -0.5829 0.3682 1.1854 -0.6309 -0.0268 ...

There are four forms of the extract operator in R: [, [[, $, and @; as noted in What is the meaning of the dollar sign $ in R function?.

The first form, [ can be used to extract content form vectors, lists, matrices, or data frames. When used with a data frame (or tibble in the tidyverse), it returns an object of type data.frame or tibble unless the drop = TRUE argument is included, as noted in the question comments by r2evans.

Since the default setting of drop= in the [ function is FALSE, it follows that df[,"f1"] produces an unexpected or "wrong" result for the code posted with the original question.

library(dplyr)
aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
             c1 = rnorm(15))

# produces unexpected answer
nlevels(aTibble[,"f1"])

> nlevels(aTibble[,"f1"])
[1] 0

The drop = argument is used when extracting from matrices or arrays (i.e. any object that has a dim attribute, as explained in help for the drop() function.

> dim(aTibble)
[1] 15  2
> 

When we set drop = TRUE, the extract function returns an object of the lowest type available, that is all extents of length 1 are removed. In the case of the original question, drop = TRUE with the extract operator returns a factor, which is the right type of input for nlevels().

> nlevels(aTibble[,"f1",drop=TRUE])
[1] 3

The [[ and $ forms of the extract operator extract a single object, so they return objects of type factor, the required input to nlevels().

> str(aTibble$f1)
 Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble$f1)
[1] 3
> 
> # produces expected answer
> str(aTibble[["f1"]])
 Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble[["f1"]])
[1] 3
> 

The fourth form of the extract operator, @ (known as the slot operator), is used with formally defined objects built with the S4 object system, and is not relevant for this question.

Conclusion: Base R is still relevant when using the Tidyverse

Per tidyverse.org, the tidyverse is a collection of R packages that share an underlying philosophy, grammar, and data structures. When one becomes familiar with the tidyverse family of packages, it's possible to do many things in R without understanding the fundamentals of how Base R works.

That said, when one incorporates Base R functions or functions from packages outside the tidyverse into tidyverse-style code, it's important to know key Base R concepts.

3
votes

I think you might need to use [[ rather than [, e.g.,

> nlevels(df[["f1"]])
[1] 3
2
votes

df[,"f1"] returns a tibble with one column. So you're doing nlevels on an entire tibble which doesn't make sense.

df %>% pull('f1') %>% nlevels

gives you what you want.