Create duplicated rows by filling groups in a data.frame

Question

Here my example data.frame:

df = read.table(text = 'ID Day Count Count_group
                18  1933    6   15
                33  1933    6   15
                37  1933    6   15
                18  1933    6   15
                16  1933    6   15
                11  1933    6   15
                111 1932    5   9
                34  1932    5   9
                60  1932    5   9
                88  1932    5   9
                18  1932    5   9
                33  1931    3   4
                13  1931    3   4
                56  1931    3   4
                23  1930    1   1
                6   1800    6   12
                37  1800    6   12
                98  1800    6   12
                52  1800    6   12
                18  1800    6   12
                76  1800    6   12
                55  1799    4   6
                6   1799    4   6
                52  1799    4   6
                133 1799    4   6
                112 1798    2   2
                677 1798    2   2
                778 888     4   8
                111 888     4   8
                88  888     4   8
                10  888     4   8
                37  887     2   4
                26  887     2   4
                8   886     1   2
                56  885     1   1
                22  120     2   6
                34  120     2   6
                88  119     1   6
                99  118     2   5
                12  118     2   5
                90  117     1   3
                22  115     2   2
                99  115     2   2', header = TRUE)

Count column shows the number of ID observations within a Day; Count_group shows the number of ID observations within a Day and its previous 4 days.

I need to expand df in order to have all the days within each Count_group episodes.

Expected output:

ID  Day Count   Count_group
18  1933    6   15
33  1933    6   15
37  1933    6   15
18  1933    6   15
16  1933    6   15
11  1933    6   15
111 1932    5   15
34  1932    5   15
60  1932    5   15
88  1932    5   15
18  1932    5   15
33  1931    3   15
13  1931    3   15
56  1931    3   15
23  1930    1   15
6   1800    6   12
37  1800    6   12
98  1800    6   12
52  1800    6   12
18  1800    6   12
76  1800    6   12
55  1799    4   12
6   1799    4   12
52  1799    4   12
133 1799    4   12
112 1798    2   12
677 1798    2   12
111 1932    5   9
34  1932    5   9
60  1932    5   9
88  1932    5   9
18  1932    5   9
33  1931    3   9
13  1931    3   9
56  1931    3   9
23  1930    1   9
778 888    4    8
111 888    4    8
88  888    4    8
10  888    4    8
37  887    2    8
26  887    2    8
8   886    1    8
56  885    1    8
55  1799   4    6
6   1799   4    6
52  1799   4    6
133 1799   4    6
112 1798   2    6
677 1798   2    6
22  120    2    6
34  120    2    6
88  119    1    6
88  119    1    6
99  118    2    6
12  118    2    6
99  118    2    6
12  118    2    6
90  117    1    6
90  117    1    6
22  115    2    6
99  115    2    6
99  118    2    5
12  118    2    5
90  117    1    5
22  115    2    5
99  115    2    5
33  1931   3    4
13  1931   3    4
56  1931   3    4
23  1930   1    4
37  887    2    4
26  887    2    4
8   886    1    4
56  885    1    4
90  117    1    3
22  115    2    3
99  115    2    3
112 1798   2    2
677 1798   2    2
8   886    1    2
56  885    1    2
22  115    2    2
99  115    2    2
23  1930   1    1
56  885    1    1

Explanation of output:

1) Day 1933 has got 6 IDs on this precise day (Count col) and 15 IDs in total from Day 1933 to Day 1929 (Count_group col). The value 15 comes from 6 (Day 1933) + 5 (1932) + 3 (1931) + 1 (1930) + 0 (1929). So in the output I added all the remaining days within Count_group = 15 episode.

2) Next Day in descending order is 1932. With 5 IDs on this precise day and 9 IDs in total from Day 1932 to Day 1928. The value 9 comes from 5 (1932) + 3 (1931) + 1 (1930) + 0 (1929) + 0 (1928). And in the output (row 28) you will see the Day 1932 complete (5-day) episode with 9 rows in total.

3) Next Day is 1931..etc, etc..

The output data.frame is ranked by Count_group and Day, both decreasing = TRUE.

I am trying to create a code that works not only for a 5-day window (as above) but for any time window of n days.

Do you have any suggestion?

Thanks

I don't fully understand how you got from the data to expected output, but you probably could use tidyr::complete(). Maybe see this question, or this one. — austensen
I'm a little confused. How do we create new rows for you. What are the rules plain and simple? Write down the process that you cannot figure out how to code. How do we calculate to new values in each of these columns to help you? Please post the response as an edit to your question. — Evan Friedland
info added. actually there are no new data to create. I just need duplicated values per each Count_group episode which need to be grouped accordingly to the Count_group episode. thanks — aaaaa

Evan Friedland Evan Friedland · Accepted Answer · 2017-06-02T17:53:58

Try this out and tell me if this is what you were thinking:

# First I split the dataframe by each day using split()
duplicates <- lapply(split(df, df$Day), function(x){
  if(nrow(x) != x[1,"Count_group"]) { # check if # of rows != the number you want
    x[rep(1:nrow(x), length.out = x[1,"Count_group"]),] # repeat them until you get it
  } else {
    x
  }
})

df2 <- do.call("rbind.data.frame", duplicates) # turn the list back into a dataframe
df3 <- df2[order(df2[,"Count_group"], df2[,"Day"], decreasing = T), ] # orderby Day & count
rownames(df3) <- NULL # names back to 1:X instead of the generated ones
df3 # the result

Create duplicated rows by filling groups in a data.frame

1 Answers