0
votes

I have data as below:

IDnum   zipcode   City          County   State
10011   36006     Billingsley   Autauga  AL 
10011   36022     Deatsville    Autauga  AL
10011   36051     Marbury       Autauga  AL
10011   36051     Prattville    Autauga  AL
10011   36066     Prattville    Autauga  AL
10011   36067     Verbena       Autauga  AL
10011   36091     Selma         Autauga  AL
10011   36703     Jones         Autauga  AL
10011   36749     Plantersville Autauga  AL
10011   36758     Uriah         Autauga  AL
10011   36480     Atmore        Autauga  AL
10011   36502     Bon Secour    Autauga  AL

I have a list of zipcodes, the cities they encompass, and counties/states they are located in. IDnum = numeric value for county and state, combined. List is in format you see now, I need to reshape it from long to wide / vertical to horizontal, where the IDnum variable becomes the unique identifier, and all other possible value combinations become wide variables.

IDnum  zip1   city1        county1  state1  zip2   city2       county2
10011  36006  Billingsley  Autauga  AL      36022  Deatsville  Autauga

This is just sample of the dataset, it encompasses every zip in the USA and includes more variables. I have seen other questions and answers similar to this one, but not where there are multiple values in almost every column.

There are commands in SPSS and STATA that will reshape data this way, in SPSS I can run a Restructure/Cases to Vars command that turns 11 variables in my initial dataset into about 1750, b/c one county has over 290 zips and it replicates most of the other variables 290+ times. This will create many blanks, but I need it to be reshaped into one very long horizontal file.

I have looked at reshape and reshape2, and am hung up on the 'default to length' error message. I did get melt/dcast to sorta work, but this creates one variable that is a list of all values, rather than creating variables for each value.

melted_dupes <- melt(zip_code_list_dupes, id.vars= c("IDnum"))
HRZ_dupes <- dcast(melted_dupes, IDnum ~ variable, fun.aggregate = list) 

I have tried tidyr and dplyr but got lost in syntax. Am a little surprised there isn't a command the data similar to built in commands in other packages, making me assume there is, and I just haven't figured it out.

Any help is appreciated.

2

2 Answers

1
votes

You can do this with the base function reshape after adding in a consecutive count by IDnum. Assuming your data is stored in a data.frame named df:

df2 <- within(df, count <- ave(rep(1,nrow(df)),df$IDnum,FUN=cumsum)) 

Provides a new column of the consecutive count named "time". And now we can reshape to wide format

reshape(df2,direction="wide",idvar="IDnum",timevar="count") 
  IDnum zipcode.1      City.1 County.1 State.1 zipcode.2     City.2 County.2 State.2 zipcode.3  City.3 County.3 State.3 zipcode.4     City.4 County.4 State.4
1 10011     36006 Billingsley  Autauga      AL     36022 Deatsville  Autauga      AL     36051 Marbury  Autauga      AL     36051 Prattville  Autauga      AL

(output truncated, goes all the way to zipcode.12, etc.)

0
votes

There might be a more efficient way, but try the following. I used my own (example) dataset, very similar to yours. Run the process step by step to see how it works, as you'll have to modify some things in the code.

library(dplyr)
library(tidyr)

# get example data
dt = data.frame(id = c(1,1,1,2,2),
zipcode = c(4,5,6,7,8),
city = c("A","B","C","A","C"),
county = c("A","B","C","A","C"),
state = c("A","B","C","A","C"))

dt

#   id zipcode city county state
# 1  1       4    A      A     A
# 2  1       5    B      B     B
# 3  1       6    C      C     C
# 4  2       7    A      A     A
# 5  2       8    C      C     C


# get maximum number of rows for a single id
# this will help you get the wide format
max_num_rows = max((dt %>% count(id))$n)

# get names of columns to reshape
col_names = names(dt)[-1]

dt %>% 
group_by(id) %>% 
mutate(nrow = paste0("row",row_number())) %>%
unite_("V",col_names) %>%
spread(nrow, V) %>%
unite("z",matches("row")) %>%
separate(z, paste0(col_names, sort(rep(1:max_num_rows, ncol(dt)-1))), convert=T) %>%
ungroup()

# # A tibble: 2 × 13
#      id zipcode1 city1 county1 state1 zipcode2 city2 county2 state2 zipcode3 city3 county3 state3
# * <dbl>    <int> <chr>   <chr>  <chr>    <int> <chr>   <chr>  <chr>    <int> <chr>   <chr>  <chr>
# 1     1        4     A       A      A        5     B       B      B        6     C       C      C
# 2     2        7     A       A      A        8     C       C      C       NA  <NA>    <NA>   <NA>