How best to structure my code in R when doing a lot of data manipulation/cleaning?

Question

I had a general question regarding most efficient coding as a beginner - I have a very wide dataset (374 obs), on which I have to do several manipulations on. I'll mainly be using 'mutate' and 'unite' . My question is:

How I write the code now is that everytime I do something new (ie if I combine 6 columns into one), then I'll write a separate code for that and create a new dataframe.

Underneath there'll be another code for 'mutate' like if I have to create a new variable by summing two columns.

here's an example:

#1B. Combine location columns. 
combinedlocations <- rawdata1 %>% unite(location, locations1,locations2, locations3, na.rm = TRUE, 
remove=TRUE)
combinedlocations <- combinedlocations[-c(6:7)] #drop the unwanted columns

#2. Combine Sector together into one new column: Sector
#B. Combine columns, but override if Type.of.org = 'Independent Artist', where Sector = "Independent 
Artist"
Combinedsectors <- combinedlocations %>% unite(Sector, Sectors, na.rm=TRUE, remove=TRUE)  %>%

I basically create a new dataframe for each manipulation, using the one I just created.

Is this correct? This is how I learned to do it on SAS. OR, is it better to do it all in one dataframe (maybe rawdata2) and is there a way to combine all these codes together, using %>% ? (I'm still trying to learn how piping works)

Ben Bolker Ben Bolker · Accepted Answer · 2020-11-17T22:10:50

This is on the edge of "opinion-based", but it's a good question. tl;dr it doesn't matter very much, it's mostly a matter of your preferred style.

putting everything in one long pipe sequence (a %>% b %>% c %>% d) without intermediate assignments means you don't have as many intermediate objects cluttering your workspace; this means (1) you don't have to come up with names for them all (data1, data2, ...) and (2) you don't use up memory making lots of extra objects (this isn't a problem unless you're working with Big Data)

On the other hand,

putting everything in one long pipe sequence can make it harder to debug, because it's harder to inspect intermediate results if something goes wrong; this blog post lists a variety of packages/tools that are handy for debugging pipe sequences.

I tend to use piping sequences of about 5-6 lines. Your code in piped format would look something like this ...

#1B. Combine location columns. 
Combinedsectors <- (rawdata1 
   %>% unite(location, locations1,locations2, locations3, 
             na.rm = TRUE, remove=TRUE)
   %>% select(-(6:7)) 
#2. Combine Sector together into one new column: Sector
#B. Combine columns, but override if Type.of.org = 'Independent Artist', where Sector = "Independent Artist"
   %>% unite(Sector, Sectors, na.rm=TRUE, remove=TRUE)  
   %>%  ... <whatever>
)

How best to structure my code in R when doing a lot of data manipulation/cleaning?

1 Answers