3
votes

I have data in this form

id  year    facname class_code       line_no    value
1     1         A        County           1      county1
1     1         A        County           2      county2
1     1         A        source1          1      9
1     1         A        source1          2      4
1     1         A        source2          1      7
1     1         A        source2          2      2
1     1         A        source3          1      8...

2     1         B        County           1      county1
2     1         B        County           2      county1
2     1         B        source1          1      21
2     1         B        source1          2      9
2     1         B        source2          1      4
2     1         B        source2          2      7 ....

I am trying to convert this into something like this : (note the last 3 columns will have the values 'spread' accordingly)

id year facname   line_no        County      source1        source2      source3
1   1    A       1               county1      9                7           8      
1   2    A       2               county2      4                2           NA
1   3    A       3               county3             
1   4    A       4               county4
2   1    B       1               county1
2   2    B       2               county2
2   3    B       3               county3
2   4    B       4               county4 

This will show the different number of payers (source1,source2, source3) and the county names (county1, county2) that the counties belong to. I know its some combination of spread (and possibly gather) but I'm unable to put my head around it.

Any help is appreciated, thanks!
(PS: I understand this is probably a duplicate question but I'm really new to tidying data)

edit: The counties (county1,2..etc.) are actually numbers (in the original dataset) but are categorical in nature, so I'm calling them as county1, the other values (sources) are actually number of people in that county participating in events (source1, source2 etc.). There are a total of 40 line_no's for each facility.

2
What is the logic you think to decide county? Its not very clear from data.MKR
@MKR Basically in this case the counties (county1, county 2, etc.) are numerical county codes (categorical). For simplicity, I used county1, county2 instead of their respective county codes ("01", "20"). While the others are numerical (2 persons in source1, 3 persons in source2, etc.) Since the value column (original dataset) is numeric in nature, I'm unable to form the logic as to how I should use spread. I made a mistake with copying the line numbers initially, I've corrected it now. Apologies for the confusionAnurag Kaushik
Thats fine. But for few rows you have mentioned county1 and for others you have mentioned a number. For consistency, can we not use conuty1, county2, county3 etc. Is that the way your data will be?MKR
The 'county3', 'county4' are not in the input dataakrun
@MKR yeah, they're all not counties. There are a total of 40 line_numbers for each facname (facility Name) of which there are only some payers (persons in source1, source2 and source3) Instead of displaying all 40, I thought this might be convenient. Sorry I should have added a "..." between facilities A and B to denote that there are a whole lot of line codes and respective sources in between. Will update the thread!Anurag Kaushik

2 Answers

2
votes

An option is to use double tidyr::spread as:

UPDATED: based on comments from @CJYetman

library(dplyr)
library(tidyr)

# Just spread can transform and work on present sample data used by OP
df %>% spread(class_code, value)

#The complicated version below based was initially used to handle different
#line numbers for rows with "County" and rows without "County"
filter(df, class_code == "County") %>% spread(class_code, value) %>% 
left_join(filter(df, class_code != "County") %>% spread(class_code, value),
  by=c("id", "line_no", "facname")) 

#   id facname line_no  County source1 source2 source3
# 1  1       A       1 county1       9       7       8
# 2  1       A       2 county2       4       2    <NA>
# 3  2       B       1 county1      21       4    <NA>
# 4  2       B       2 county1       9       7    <NA>

Data:

df <- read.table(text = 
"id  facname class_code       line_no    value
1   A        County           1      county1
1   A        County           2      county2
1   A        source1          1      9
1   A        source1          2      4
1   A        source2          1      7
1   A        source2          2      2
1   A        source3          1      8
2   B        County           1      county1
2   B        County           2      county1
2   B        source1          1      21
2   B        source1          2      9
2   B        source2          1      4
2   B        source2          2      7",
header = TRUE, stringsAsFactors = FALSE)
2
votes

We can use dcast from data.table

library(data.table)
dcast(setDT(df1), id + facname + rowid(class_code) ~ class_code, value.var = 'value')
#     id facname class_code  County source1 source2 source3
#1:  1       A          1 county1       9       7       8
#2:  1       A          2 county2       4       2      NA
#3:  2       B          3 county1      21       4      NA
#4:  2       B          4 county1       9       7      NA

If we need the 8 rows as in the expected output

dcast(setDT(df1), id + facname + rowid(class_code) ~ class_code, 
    value.var = 'value', drop = FALSE)[ ,.SD[!all(is.na(County))], .(id, facname)]
#   id facname class_code  County source1 source2 source3
#1:  1       A          1 county1       9       7       8
#2:  1       A          2 county2       4       2      NA
#3:  1       A          3      NA      NA      NA      NA
#4:  1       A          4      NA      NA      NA      NA
#5:  2       B          1      NA      NA      NA      NA
#6:  2       B          2      NA      NA      NA      NA
#7:  2       B          3 county1      21       4      NA
#8:  2       B          4 county1       9       7      NA