1
votes

I am rather stuck. I have 2 data frames - df1 has unique station IDs, % values by month, & num of occurrences (years) of that station in the data; df2 has repeated station IDs by year and values per month by year.

df1: represents percentage on non-missing temperature data for each station by month; n represents number of years in that station's record

station_ID  Jan  Feb  Mar ... Dec  n
10160355    37   39   38      39   141
10160360    94   91   98      89   56
10160390    83   87   85      82   163

df2: temperature data by station for each month and year; n from df1 is the length of the repeated station_ID in df2

station_ID  year  Jan  Feb  Mar ... Dec
10160355    1878  NA   10   12      12
10160355    1879  12   12   13      10
...
10160355    2018  14   11   15      14
10160360    1963  12   10   12      14
10160360    1964  10   12   15      11
...
(repeats for all stations & total rows = 277604)

What I need: For each monthly column, if df1$station < 50%, replace the data in df2 with NAs for all rows for that station/month - else, leave df2 as is. So, since df1$station_ID[1] shows only 37% for Jan, all January's for that station (df2$station[1:141]) becomes NA.

Example output I need:

station_ID   year  Jan  Feb ...  Dec
10160355     1878  NA   NA       NA
10160355     1879  NA   NA       NA
...
10160360     1963  12   10       14
10160360     1964  10   12       11
...

I've tried about 20 different methods, but I think I need some form of dplyr with rep to repeat the NAs for rows for each station when the condition is true.

Latest attempt with just one month at a time since I couldn't figure out how to do all cols:

df3 =  df2 %>%
    group_by(station_ID) %>%
    select(Jan) %>%
    mutate(if_else(df1$Jan < 50, rep(NA_character_, df1$n), Jan))

This gives an error for invalid 'times' for rep. I think I might be close, but I appreciate any suggestions! Thanks!

1

1 Answers

0
votes

Things like this are much easier to do in "long" format - especially with dplyr

library(dplyr)
library(tidyr)

df1_long = pivot_longer(df1, cols = Jan:Dec, names_to = "month", values_to = "non_missing")
df2_long = pivot_longer(df2, cols = Jan:Dec, names_to = "month", values_to = "temp")

result_long = df2_long %>%
  left_join(df1_long) %>%
  mutate(temp = ifelse(non_missing < 50, NA, temp))

result_long
# # A tibble: 20 x 6
#    station_ID  year month  temp     n non_missing
#         <int> <int> <chr> <int> <int>       <int>
#  1   10160355  1878 Jan      NA   141          37
#  2   10160355  1878 Feb      NA   141          39
#  3   10160355  1878 Mar      NA   141          38
#  4   10160355  1878 Dec      NA   141          39
#  5   10160355  1879 Jan      NA   141          37
#  6   10160355  1879 Feb      NA   141          39
#  7   10160355  1879 Mar      NA   141          38
#  8   10160355  1879 Dec      NA   141          39
#  9   10160355  2018 Jan      NA   141          37
# 10   10160355  2018 Feb      NA   141          39
# 11   10160355  2018 Mar      NA   141          38
# 12   10160355  2018 Dec      NA   141          39
# 13   10160360  1963 Jan      12    56          94
# 14   10160360  1963 Feb      10    56          91
# 15   10160360  1963 Mar      12    56          98
# 16   10160360  1963 Dec      14    56          89
# 17   10160360  1964 Jan      10    56          94
# 18   10160360  1964 Feb      12    56          91
# 19   10160360  1964 Mar      15    56          98
# 20   10160360  1964 Dec      11    56          89

In many cases (especially making charts, but modeling too), I'd advise you to stick with this long-format data. However, it can be converted back to your original wide format:

result_wide = result_long %>%
  select(-n, -non_missing) %>%
  pivot_wider(names_from = "month", values_from = "temp")
result_wide
# # A tibble: 5 x 6
#   station_ID  year   Jan   Feb   Mar   Dec
#        <int> <int> <int> <int> <int> <int>
# 1   10160355  1878    NA    NA    NA    NA
# 2   10160355  1879    NA    NA    NA    NA
# 3   10160355  2018    NA    NA    NA    NA
# 4   10160360  1963    12    10    12    14
# 5   10160360  1964    10    12    15    11

Using this data:

df1 = read.table(text = 'station_ID  Jan  Feb  Mar  Dec  n
10160355    37   39   38      39   141
10160360    94   91   98      89   56
10160390    83   87   85      82   163', header = T)

df2 = read.table(text = 'station_ID  year  Jan  Feb  Mar  Dec
10160355    1878  NA   10   12      12
10160355    1879  12   12   13      10
10160355    2018  14   11   15      14
10160360    1963  12   10   12      14
10160360    1964  10   12   15      11', header = T)