1
votes

I have some data which looks like:

# A tibble: 6 x 3
  Time    Date       Weather             
  <chr>   <date>     <chr>               
1 "7:00 " 2010-01-01 Passing clouds      
2 "7:30 " 2010-01-01 Passing clouds      
3 "8:00 " 2010-01-01 Passing clouds      
4 "8:30 " 2010-01-01 Passing clouds      
5 "9:00 " 2010-01-01 Partly sunny        
6 "9:30 " 2010-01-01 Drizzle Partly sunny

Which has hourly data for each day. I am trying to collapse it down into a daily series and create some dummy variables but not for each 30 minute interval.

That is, when I create dummy variables currently it creates too many columns. Which is why I am trying to collapse it based on some condition. The condition is if the Weather has 4 consecutive observations the same then keep it. i.e. Passing clouds has 4 consecutive Weather conditions but Partly sunny does not, the same for Drizzle Partly sunny.

I currently have the following:

library(splitstackshape)
df %>% 
  group_by(Date) %>% 
  arrange(Weather) %>% 
  distinct(Weather) %>% 
  summarise(text = paste(Weather, collapse = "_")) %>% 
  cSplit_e(., split.col = "text", sep = "_", type = "character",
           mode = "binary", fixed = TRUE, fill = 0)

But this does it for all unique observations in the Weather column which gives me too many columns. So I am trying to add a condition to only keep the columns if they have 4 or more consecutive observations.

Data:

df <- structure(list(Time = c("7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:00 ", 
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", 
"11:00", "11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ", 
"2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", 
"6:00 ", "6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", "11:00", 
"11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ", "2:30 ", 
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ", 
"6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", 
"10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", 
"10:00", "10:30", "11:00", "11:30", "12:00", "12:30", "1:00 ", 
"1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ", 
"5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", "7:00 ", 
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", 
"11:00", "11:30", "12:00", "1:00 ", "1:30 ", "2:00 ", "2:30 ", 
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ", 
"6:30 ", "7:00 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:05 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 "), Date = structure(c(14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14615, 
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 
14615, 14615, 14615, 14615, 14615, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14617), class = "Date"), Weather = c("Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Partly sunny", "Drizzle Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Drizzle Partly sunny", "Drizzle Partly sunny", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Fog", 
"Passing clouds", "Passing clouds", "Light fog", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Broken clouds", "Partly cloudy", "Partly cloudy", "Partly cloudy", 
"Partly cloudy", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Rain Partly sunny", "Rain Partly sunny", "Rain Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog", 
"Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Fog", 
"Fog", "Fog", "Fog", "Light rain Fog", "Light rain Fog", "Rain Fog", 
"Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Fog", 
"Partly sunny", "Broken clouds", "Broken clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Light rain Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Fog", 
"Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Partly sunny", 
"Broken clouds", "Broken clouds", "Broken clouds", "Broken clouds", 
"Broken clouds", "Broken clouds", "Broken clouds", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Scattered clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Partly cloudy", 
"Broken clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Broken clouds", "Broken clouds", "Broken clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Passing clouds", "Passing clouds", 
"Rain Low clouds", "Rain Low clouds", "Rain Low clouds", "Rain Low clouds", 
"Light rain Mostly cloudy", "Light rain Mostly cloudy", "Light rain Mostly cloudy", 
"Light rain Mostly cloudy", "Rain Low clouds", "Light rain Mostly cloudy", 
"Light rain Mostly cloudy", "Rain Mostly cloudy", "Snow Mostly cloudy", 
"Snow Mostly cloudy", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog", 
"Snow Ice fog", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog", 
"Snow Ice fog", "Light snow Ice fog", "Light snow Ice fog", "Ice fog", 
"Passing clouds", "Partly cloudy", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -200L))
1

1 Answers

1
votes

Are you looking for something like this:

library(dplyr)df_new <- df %>% 
  group_by(Date) %>% 
  mutate(repeated = rep(rle(Weather)$lengths, rle(Weather)$lengths)) %>% 
  filter(repeated >= 4)

df_new
#> # A tibble: 148 x 4
#> # Groups:   Date [7]
#>    Time    Date       Weather        repeated
#>    <chr>   <date>     <chr>             <int>
#>  1 "7:00 " 2010-01-01 Passing clouds        4
#>  2 "7:30 " 2010-01-01 Passing clouds        4
#>  3 "8:00 " 2010-01-01 Passing clouds        4
#>  4 "8:30 " 2010-01-01 Passing clouds        4
#>  5 10:00   2010-01-01 Partly sunny         10
#>  6 10:30   2010-01-01 Partly sunny         10
#>  7 11:00   2010-01-01 Partly sunny         10
#>  8 11:30   2010-01-01 Partly sunny         10
#>  9 12:00   2010-01-01 Partly sunny         10
#> 10 12:30   2010-01-01 Partly sunny         10
#> # … with 138 more rows

df_new %>% 
  summarise(text = paste(unique(Weather), collapse = "_")) 
#> # A tibble: 7 x 2
#>   Date       text                                                          
#>   <date>     <chr>                                                         
#> 1 2010-01-01 Passing clouds_Partly sunny_Scattered clouds                  
#> 2 2010-01-02 Scattered clouds_Partly sunny_Partly cloudy_Passing clouds    
#> 3 2010-01-03 Passing clouds_Partly sunny                                   
#> 4 2010-01-04 Drizzle Fog_Fog_Rain Fog_Passing clouds                       
#> 5 2010-01-05 Fog_Broken clouds_Partly sunny                                
#> 6 2010-01-06 Scattered clouds                                              
#> 7 2010-01-07 Rain Low clouds_Light rain Mostly cloudy_Snow Ice fog_Passing…

Created on 2019-11-25 by the reprex package (v0.3.0)

rle counts repeated consecutive values. I'm wrapping it in rep to make it the correct length for the mutate call, but you could also run it on its own to get a grip of how it works (if you don't already know). Once you know how often each value is repeated, it's easy to first filter then summarise.