2
votes

I was given a spectroscopy reference database that is very different from the sample dataset I'm using. In my data, 0 or 1 indicate if there's a peak or not, while in the reference database the peak position is listed as a row value and grouped as a column in terms of peptide data (which I don't need).

My dataset looks like this:

Sample   1110   1111   1112
1         1        0      0
2         1        0      1
3         0        1      1
4         1        1      1

While the reference database obnoxiously looks like this (note the multiple values per column):

Species     peptide1   peptide2    peptide3
cow           1110        1112         NA
sheep         1111        1112         NA
goat           NA         1113        1114

The desired output would look similar to my dataset:

Species   1110    1111    1112    1113    1114
cow         1        0      1        0      0
sheep       0        1      1        0      0
goat        0        0      0        1      1

These are oversimplified but they illustrate the problem - let me know if I need to provide actual data. I need to both transpose/sort the values in the cells, while replacing the original values in a binary (again, I don't need to keep the peptide column names in the reference database). I'm really hoping there's an easy dplyr or tidyr trick for this - I imagine a spread function could work, but I wouldn't know how to do it for multiple columns nor how to leave the original data. Alternatively, I could manually append all of the data as a long format, then melt/cast it into a wider format?

2
@RonakShah I've updated the question with it!Mulligan

2 Answers

6
votes

I hope this is what you are looking for:

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(!Species) %>%
  mutate(val = 1) %>%
  select(-name) %>%
  drop_na() %>%
  arrange(value) %>%
  pivot_wider(names_from = value, values_from = val) %>%
  mutate(across(!Species, ~ replace_na(., 0)))

# A tibble: 3 x 6
  Species `1110` `1111` `1112` `1113` `1114`
  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 cow          1      0      1      0      0
2 sheep        0      1      1      0      0
3 goat         0      0      0      1      1

Data

df <- tribble(
  ~Species,     ~peptide1,   ~peptide2,    ~peptide3,
  "cow",           1110,        1112,         NA,
  "sheep",         1111,        1112,         NA,
  "goat",           NA,         1113,        1114
)

Simplifying the above syntax by using the really useful arguments of pivot_longer() & pivot_wider() as

df %>% 
  pivot_longer(!Species, values_drop_na = TRUE) %>% 
  pivot_wider(id_cols = Species, names_from = value, names_sort = TRUE, values_fill = 0, values_fn = length)

# A tibble: 3 x 6
  Species `1110` `1111` `1112` `1113` `1114`
  <chr>    <int>  <int>  <int>  <int>  <int>
1 cow          1      0      1      0      0
2 sheep        0      1      1      0      0
3 goat         0      0      0      1      1
2
votes

In data.table you can use melt and dcast :

library(data.table)
dcast(melt(setDT(df), 1, na.rm = TRUE), Species~value, fun.aggregate = length)

#    Species 1110 1111 1112 1113 1114
#1:     cow    1    0    1    0    0
#2:    goat    0    0    0    1    1
#3:   sheep    0    1    1    0    0