2
votes

I am trying to end up with a time series plot comparing different cities with a center data (data frame).Where center is a dataframe object in R studio, I already imported.

I have a folder with 165 csv files, each representing a city. I want to plot all the 165 csv files (as independent names/data frame) in one plot plus the center data frame.

I want it to look something like this: (with the x-axis being time and the y-axis being CO with all being solid colors.

enter image description here

There are four things I want to do to each csv file, but in the end, have it automated that these four actions are done to each of the 165 csv files.

1) Skip the first 25 row of the csv file

2) Combine the Date and Time column for each csv file

3) Remove the rows where the values in the cells in column 3 is empty

4) Change the name of column 3 from ug/m3 to CO

I want it to perform the four actions on each of the 165 csv files in an automated way.Then, be able to efficiently plot the newly updated csv files in one plot.

I used the code below on one csv file to see if it would work on one csv. I am not sure how to combine everything in an efficient manner.achieve this:

city1 <- read.csv("path",
                        skip = 25)

city1$rtime <- strptime(paste(city1$Date, city1$Time), "%m/%d/%Y %H:%M")

colnames(city1)[3] <- "CO"

city[,3][!(is.na(city[,3]))] ## side note: help with this would be appreciated, I was unsure of what goes before the comma.

Overall, I want a plot like above with all the 165 cities (csv files). I need help placing the four actions on each csv file and plot them all in one plot.

For the plot, I did this as an example:

ggplot(center, aes(rtime, CO)) + geom_smooth(aes(color="Center"))+
  geom_smooth(data=city1,aes(color="City1"))+
  labs(color="Legend")

UPDATE: The csv file of each city seemed to have combined to create one line.I am not sure if I can post the exact output but it looked like the one below: with the pink line being cities and blue being center.x-axis time and y-axis being CO.I hope this helps.

Result of unique(df.cleaned$cities)

> unique(df.cleaned$cities)
 [1] "WFH4N_YEK04_PORTLAND_08AUG16_R1"
 [2] "WFH2N_QIM23_AUSTIN_30JUL16_R1" 
 [3] "WFH7N_QIM70_NEWYORK_20JUL16_R1"
 [4] "WFH3N_YEK28_NAMPA_23AUG16_R1"
 [5] "WFH9N_YEK18_MESA_12JUL16_R1"
 [6] "WFH6N_QIM10_OAKLAND_11AUG16_R1"
 [7] "WFH3N_YEK01_DETROIT_30AUG16_R1"
 [8] "WFH6N_YEK05_ATLANTA_30AUG16_R1"
 [9] "WFH1N_YEK32_LONGBEACH_01JUL16_R1"
[10] "WFH8N_YEK39_LOSANGELES_30AUG16_R1"
[11] "WFH5N_YEK59_BALTIMORE_31AUG16_R1"
[12] "WFH1N_QIM19_MEMPHIS_01JUL16_R1"
[13] "WFH0N_YEK2087_DENVER_09JUL16_R1"
[14] "WFH4N_QIM43_CLEVELAND_30AUG16_R1"
[15] "WFH8N_QIM65_HARTFORD_30AUG16_R1"
[16] "WFH2N_YEK66_SEATTLE_30AUG16_R1"
[17] "WFH0N_YEK17_SANJOSE_30AUG16_R1"

enter image description here

1
To automate the part of reading 165 csv files, you can get a vector of characters for all the csv file names, and then do lapply over the filenames. For example file_names <- list.files(path ="your folder path", pattern = ".csv") to get the filenames, and then lapply(file_names, FUN = function(file){...}) - shaojl7
With the regex I posted, all of those will parse to "", since your city names are capitalized, not title case like the example you gave. If you don't try to extract the city name, but instead leave that column cities like that, your plot should have separate lines for each one, as given by aes(colour = cities). Does that part work correctly? - Brian
@Brian I decided not to try the regex when I realized the example I had given was different.Without the adding regex, it still gave me one line. - Mah
I'm afraid I can't reproduce the error without some sample data then. Could you share 1-2 of the CSVs or are they too large? - Brian

1 Answers

1
votes

This is a fully tidyverse solution, so base R traditionalists, look away now.

library(dplyr)
library(purrr)
library(tidyr)
library(readr)
library(stringr)
library(lubridate)
# or just library(tidyverse)

df <-
  data_frame(files = list.files(path = "./yourfilepath",    
                                pattern = "csv", 
                                full.names = T)) %>%
    mutate(dfs = map(files, read_csv, skip = 25)) %>%       
    unnest() %>% 
    mutate(cities = str_replace_all(files, "./yourfilepath/", ""),
           cities = str_replace_all(cities, ".csv", ""))
  • This chunk first makes a list of all the files in the folder yourfilepath named *.csv.
  • Then you use purrr::map to run the same function on each element of the files variable: read_csv, skipping the first 25 lines.
  • Now you have a dataframe with 165 nested dataframes in the column dfs.
  • unnest() expands these to full dataframes, repeating the files rows as needed.
  • Then you can strip out the extra parts of the path from files to get your city names, assuming that the filename of the .CSV is the name of the city.
    • If you have the name of the city already in a column inside each .CSV, then you don't need the mutate(cities = ...) lines.

Now it's easy peasy to do the rest:

df.cleaned <- df %>%
  rename(CO = `ug/m3`) %>%
  filter( !is.na(CO)) %>%
  mutate(rtime = paste(Date, Time),
         rtime = mdy_hm(rtime))

And now use your plotting code, slightly updated:

ggplot(center, aes(rtime, CO)) + 
  geom_smooth(aes(color="Center"))+
  geom_smooth(data = df.cleaned, aes(color = cities))+
  labs(color="Legend")

Update with more regex

df.cleaned <- df.cleaned %>%
  group_by(files) %>%
  mutate(cities = str_c(str_extract_all(cities, 
                                  "[A-Z][a-z]+", 
                                  simplify = T), 
                        collapse = " "))