gathering the iris data set. in R

Question

I want to modify the iris data set in R. It has 5 variables: species, sepal.length, sepal.width, petal.length and petal.width. I need a new column called part which specifies if it's the sepal or the petal and length and width columns which show the measurements. the result should look like this:

I also need make another one where there is a column called measure which indicates the type of measurement (length or width) and shows with the species as variables. It should look like this:

How do I do this using tidyr?

Welcome to SO, Ahmad. What solutions have you tried thus far? — Len Greski

Len Greski Len Greski · Accepted Answer · 2020-03-29T18:53:16

Here is an approach where we make a narrow format tidy data set to start, and then use pivot_wider() to get the result with Length and Width columns.

library(tidyr)
# add an ID variable so we can pivot_wider and match measurement for correct observations
id <- 1:nrow(iris)
data <- cbind(id,iris)
data %>%  gather(.,key = "part.measurement",value = "value",-id,-Species) %>%
     separate(.,part.measurement,c("part","measurement")) -> narrow_data
 head(narrow_data[2:5])

> head(narrow_data[2:5])
  Species  part measurement value
1  setosa Sepal      Length   5.1
2  setosa Sepal      Length   4.9
3  setosa Sepal      Length   4.7
4  setosa Sepal      Length   4.6
5  setosa Sepal      Length   5.0
6  setosa Sepal      Length   5.4

At this point we can use pivot_wider() to create the Length and Width columns. We'll add an arrange() so the sort order matches the image posted with the question.

narrow_data %>% pivot_wider(.,names_from = measurement,values_from = value) %>%
     arrange(Species,part)-> wide_data
head(wide_data[2:5])

...and the output:

> head(wide_data[2:5])
# A tibble: 6 x 4
  Species part  Length Width
  <fct>   <chr>  <dbl> <dbl>
1 setosa  Petal    1.4   0.2
2 setosa  Petal    1.4   0.2
3 setosa  Petal    1.3   0.2
4 setosa  Petal    1.5   0.2
5 setosa  Petal    1.4   0.2
6 setosa  Petal    1.7   0.4
>

The second output is tricky because it essentially merges the 200 observations of part and measurement for each species of flower into an output tibble of 200 rows, one for each combination of part and length for each of the 50 observations of each Species.

#  reproduce 2nd output
speciesId <- c(1:200,1:200,1:200) # unique obs within species
narrow_species_data <- cbind(speciesId,narrow_data[order(narrow_data[,1],narrow_data[,3],narrow_data[,4]),c(2:5)])

narrow_species_data %>% pivot_wider(.,names_from= Species,values_from = value) %>%
     arrange(part,measurement,speciesId) -> wide_data_species
head(wide_data_species[2:6])

...and the output:

>  head(wide_data_species[2:6])
# A tibble: 6 x 5
  part  measurement setosa versicolor virginica
  <chr> <chr>        <dbl>      <dbl>     <dbl>
1 Petal Length         1.4        4.7       6  
2 Petal Length         1.4        4.5       5.1
3 Petal Length         1.3        4.9       5.9
4 Petal Length         1.5        4         5.6
5 Petal Length         1.4        4.6       5.8
6 Petal Length         1.7        4.5       6.6
>

A "completely tidyverse" version

Here is a version of both parts of the question that solely uses features from the tidyverse family of packages.

For the first question, we use mutate() and seq_along() to create unique sequential numbers to identify each observation in the original data. We create a narrow form tidy data set with gather(), and then convert it into the desired output with pivot_wider(). To match the order of observations from the image in the original question, we arrange(Species,part).

library(tidyr)
library(dplyr)
# add an ID variable so we can pivot_wider and match measurement for correct observations
iris %>% mutate(id = seq_along(Species)) %>%  gather(.,key = "part.measurement",value = "value",-id,-Species) %>%
     separate(.,part.measurement,c("part","measurement")) -> narrow_data

narrow_data %>% pivot_wider(.,names_from = measurement,values_from = value) %>% 
     arrange(Species,part) -> wide_data
head(wide_data[2:5])

...and the output:

> head(wide_data[2:5])
# A tibble: 6 x 4
     id part  Length Width
  <int> <chr>  <dbl> <dbl>
1     1 Petal    1.4   0.2
2     2 Petal    1.4   0.2
3     3 Petal    1.3   0.2
4     4 Petal    1.5   0.2
5     5 Petal    1.4   0.2
6     6 Petal    1.7   0.4
>

For the second question, instead of building a vector of sequential IDs for each species and using cbind() it to the rest of the data, we can use dplyr functions to create the sequences within a pipeline.

We use arrange() to sort the data by Species, id, part, and measurement. Then we group_by(Species) so we can use mutate() to create a unique sequential ID with seq_along(). The sort order is important because we want to merge the 1st observation with the 51st observation and the 101st observation.

Then we ungroup() to clear the group_by() and use pivot_wider() with id_cols = speciesId to create the desired output.

narrow_data %>% arrange(Species,id,part,measurement) %>% group_by(Species) %>% mutate(speciesId = seq_along(Species)) %>%
   ungroup(.) %>% pivot_wider(.,id_cols=c("speciesId","part","measurement"),names_from= Species,values_from = value) %>%
     arrange(part,measurement,speciesId) -> wide_data_species
head(wide_data_species[2:6])

...and the output:

> head(wide_data_species[2:6])
# A tibble: 6 x 5
  part  measurement setosa versicolor virginica
  <chr> <chr>        <dbl>      <dbl>     <dbl>
1 Petal Length         1.4        4.7       6  
2 Petal Length         1.4        4.5       5.1
3 Petal Length         1.3        4.9       5.9
4 Petal Length         1.5        4         5.6
5 Petal Length         1.4        4.6       5.8
6 Petal Length         1.7        4.5       6.6
>

gathering the iris data set. in R

3 Answers

A "completely tidyverse" version