I performed CV calculation over numeric columns of a data frame grouped by a specific column values. Then based on resulting dataframe, I replaced the related column values in first data frame. I did the operation by mixing dplyr functions, merge function, and for loop with if statement. I would like to ask you for a shorter operations of tidyverse functions if any. Here are the steps and example outcome.
I had a data frame- A
Accession Column1 Column2 Column3 Root ID
2000_1 0 0.2 14 2000 1
2000_2 0.01 0.2 17 2000 2
2001_1 0.012 0.22 11 2001 1
2001_2 0.011 0.231 17 2001 2
For this demo I have three numerical columns Column1, Column2, and Column3, but in actual data it is > 500. I subset the data frame to keep only numeric columns (Column1, Column2, Column3, and Root) except ID column. Then I grouped the rows based on column "Root" and calculated the coefficient of variance as a second data frame by using aggregate function/also using tibble functions groupby/summarize_if/ungroup. I replaced the values with NA if they are greater than 30. Here is the resulting dataframe- B
Column1 Column2 Column3 Root
NA 0 13.68 2000
6.14 3.44 NA 2001
I merged data frame A and B by using merge function based on Root column. Resulting data frame was
Column1.x Column2.x Column3.x Accession Column1.y Column2.y Column3.y Root ID
NA 0 13.68 2000_1 0 0.2 14 2000 1
NA 0 13.68 2000_2 0.01 0.2 17 2000 2
6.14 3.44 NA 2001_1 0.012 0.22 11 2001 1
6.14 3.44 NA 2001_2 0.011 0.231 17 2001 2
Then I looped through numeric columns (Column 1, 2, 3). For Root column value 2000, I replaced Column1.y values (0 and 0.01) with Column1.x values are NA. For Root column value 2001, I didn't replace Column1.y values are not NA.
Resulting data frame was
Accession Column1.y Column2.y Column3.y Root ID
2000_1 NA 0.2 14 2000 1
2000_2 NA 0.2 17 2000 2
2001_1 0.012 0.22 NA 2001 1
2001_2 0.011 0.231 NA 2001 2
Column1.y
column from the two columnsColumn1.x
andColumn1.y
and same forColumn2.y
andColumn2.x
? – Ronak Shah