2
votes

I would like to create new rows in a data.frame for all missing years for each group (firm and type). The dataframe looks as follows:

minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"),
                  type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"),
                  year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008),
                  value = c(1,3,7,9,9,2,3,3,7,5,9,15)
                  )

Dataframe:

firm type year value
A    X    2000     1
A    X    2004     3
A    X    2007     7
B    X    2010     9
B    X    2008     9
B    X    2001     2
A    Y    2002     3
A    Y    2003     3
A    Y    2007     7
B    Y    2000     5
B    Y    2001     9
B    Y    2008    15

Now, what I want to get is the following: I can see in the data that the minimum year is 2000 and the maximum is 2010. I want to add a row for each missing year for each combination of firm-type. E.g. for firm A and type X, I would like to add rows such that it looks like this:

Final output:

firm type year value
A    X    2000     1
A    X    2004     3
A    X    2007     7
A    X    2001     1
A    X    2002     1
A    X    2003     1
A    X    2005     3
A    X    2006     3
A    X    2008     7
A    X    2009     7
A    X    2010     7

Additionally, I want to write the value from the previous year into the column 'value' for the missing row for all subsequent years until a new non-missing row appears (as seen in the final output example).

I have not yet come up with any useful code, but what I have found so far is the following which might be the right direction:

setDT(minimal)[, .SD[match(2000:2010, year)],
                           by = c("firm","type")]

I don't really understand the concept of setDT and .SD, but this creates at least one row for each firm type combination. However, there is not content for year.

Thanks a lot in advance!

3
I think there are dupes for this. Check for ?complete from tidyr or ?expand.grid from base R or CJ from data.tableakrun
Okay I came up with min2 <- expand.grid(year = min(minimal$year):max(minimal$year), firm = unique(minimal$firm), type = unique(minimal$type)) and merge(min2,minimal, by = c("firm","type","year"), all.x = T). Now I only need to add the correct values to each row which I still don'tk now how to do.lorenzbr
try this: library(dplyr); library(tidyr); minimal %>% group_by(firm, type) %>% complete(year = full_seq(year, 1)) %>% fill(value)Sotos
Cool this is really nice code. However, I still have the problem that it takes the minimum and maximum year for each group(firm,type). I actually need the total minimum and maximum which often differs from the group min and max.lorenzbr
Okay, it is simply year = full_seq(2000:2010,1). Thanks!lorenzbr

3 Answers

4
votes

I could not find an exact dupe for this so here is a possible solution,

library(dplyr)
library(tidyr)

minimal %>% 
  group_by(firm, type) %>% 
  complete(year = full_seq(2000:2010, 1)) %>% 
  fill(value)
0
votes

I wrote this code that do what you wanted, maybe it's not so efficient or elegant but it works:

# Input dataframe
minimal <- data.frame(firm = c("A","A","A","B","B","B","A","A","A","B","B","B"),
                      type = c("X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"),
                      year = c(2000,2004,2007,2010,2008,2001,2002,2003,2007,2000,2001,2008),
                      value = c(1,3,7,9,9,2,3,3,7,5,9,15)
)

# Sorting is needed
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),]

# Variables used
table = table(minimal$firm=="A", minimal$type=="X")
minYear = min(minimal$year)
maxYear = max(minimal$year)
startPos = 0

# Iterates the dataframe
for(i in 1:2){
  for(j in 1:2){
    prevValue = 0
    currYear = minYear

    # Adds minimum year if needed
    if(minimal$year[1+startPos] != currYear){
      newRow = c(as.character(minimal$firm[1+startPos]), as.character(minimal$type[1+startPos]), currYear, prevValue)
      minimal = rbind(minimal, newRow)
    }

    # Adds years
    for(k in (1+startPos):(table[i,j]+startPos)){
      if(minimal$year[k]!=currYear){
        currYear = currYear + 1
        while(minimal$year[k]!=currYear){
          newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear, prevValue)
          minimal = rbind(minimal, newRow)
          currYear = currYear + 1
        }
      }
      prevValue = minimal$value[k]
    }

    # Adds years from last to maximum
    if(currYear < maxYear){
      for(l in 1:(maxYear - currYear)){
        newRow = c(as.character(minimal$firm[k]), as.character(minimal$type[k]), currYear+l, prevValue)
        minimal = rbind(minimal, newRow)
      }
    }
    startPos = startPos + table[i,j]

  }
}

# Result
minimal = minimal[order(minimal$firm, minimal$type, minimal$year),]
minimal
0
votes

Here is a data.table solution.

library(data.table)

dt <- setDT(minimal)[CJ(firm=firm, type=type, year=seq(min(year), max(year)), unique=TRUE),
              on=.(firm, type, year), roll=TRUE]

This returns

head(dt, 15)
    firm type year value
 1:    A    X 2000     1
 2:    A    X 2001     1
 3:    A    X 2002     1
 4:    A    X 2003     1
 5:    A    X 2004     3
 6:    A    X 2005     3
 7:    A    X 2006     3
 8:    A    X 2007     7
 9:    A    X 2008     7
10:    A    X 2009     7
11:    A    X 2010     7
12:    A    Y 2000    NA
13:    A    Y 2001    NA
14:    A    Y 2002     3
15:    A    Y 2003     3

Notice that the initial rows of the second firm-type combo are NA. If you want to fill these in with the subsequent year, you can adjust the argument of fill to "nearest", though this could effect the values in the middle of the data.