0
votes

I am trying to prepare my variables to use the data in a regression analysis. I get an error when I create the following data table. I need to prepare the data to display the number of times a member participates in the debate (n_Edu) per year and include the other relevant variables alongside. All of the variables seem to be fine, except for the days_in_house one. Here is my code.

library(data.table)

df1 <- data.table(df1)

mp_by_year <- df1[,list(n_parent_Edu = sum(parent_Edu), isFemale = unique(isFemale), party = unique(party), days_in_house = unique(days_in_house)), by = list(member_id, year)]

When I run this code without the day_in_house variable (ie just with the isFemale, parent_Edu, member_id, year and party variables) it works fine and produces a new data frame. However when I add this variable, it gives me the below error. The variable looks like this:

days_in_house
1647
6383
463
3528
462
3639
16
1738
16
187
3732

...and so on. I get the following error when I add in this variable to the data table:

"Supplied 2 items for column 3 of group 242 which has 5 rows. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code."

My other variables appear as follows:

isFemale

0
1
0
0
0
0
1

party

Conervative
Labour
Liberal Democrats
Conservative
Conervative
Labour

membership_id

463
283
352
287
27
372

year

1997
1997
1997
1997
1997
1
It's hard to say for sure without seeing reproducible data (see stackoverflow.com/questions/5963269/…), but I think the issue is that for that group, one of the preceding column definitions produces 4 rows, while the last produces 2 (either 1 or 4 would be okay). So it isn't necessarily the days_in_house variable causing the problem. I would verify that each use of unique is returning one value (or the expected number of values) per (member_id, year), something like df1[,uniqueN(isFemale),list(member_id,year)][N != <expected_value]smingerson

1 Answers

2
votes

The issue is that unique can return a variable number of results. For some of your fields, the result has 5 rows whereas other fields may be different. Here is a simple reprex for the error:

library(data.table)

dt = data.table(grp = 1L,
           party = c("A", "A", "B"),
           days = 1:3,
           val = rnorm(3L))

dt
#>      grp  party  days       val
#>    <int> <char> <int>     <num>
#> 1:     1      A     1 -0.946899
#> 2:     1      A     2 -2.094639
#> 3:     1      B     3  1.033007

dt[ ,
   .(sum(val), unique(party), unique(days)),
   by = grp
   ]
#> Error in `[.data.table`(dt, , .(sum(val), unique(party), unique(days)), : Supplied 2 items for column 2 of group 1 which has 3 rows. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

The issue was that unique(party) results in 2 records whereas the grp has 3 records. It sounds like what you actually want to do is group by more items to show everything that is unique:

dt[,
   .(sum(val)),
   by = .(grp, party, days)]
#>      grp  party  days          V1
#>    <int> <char> <int>       <num>
#> 1:     1      A     1  0.87004621
#> 2:     1      A     2 -2.36972622
#> 3:     1      B     3  0.05793804

For your dataset, you would use:

df1[ , 
    .(n_parent_Edu = sum(parent_Edu)), 
    by = .(member_id, year, isFemale, party, days_in_house)]

For future questions, it is nice to simplify a dataset as I did above. Or, worst case, you can use dput(head(df1, 10L)) or modify the dataset in order to reproduce the problem.