1
votes

I have a folder containing 630 excel files, all with similar file names. Each file represents climate data in specific geographic areas for a month of a specific year. My goal is to find a way to iterate my importing of these files and find the average of values for specific variables. All files are titled as such:

PRISM_ppt_stable_4kmM3_201201_bil

where "ppt" represents climate variable the data is about, "2012" represents the year 2012 and "01" represents the month of January. The next file in the folder is titled:

PRISM_ppt_stable_4kmM3_201202_bil

where "ppt" represents the same variable,"2012" again represents the year 2012 and "02" this time represents the month of February. These repeat for every month of every year and for 7 different variables. The variables are titled:

ppt, vpdmax, vpdmin, tmax, tmin, tdmean, tmean

Each excel file contains >1500 observations of 11 variables where I am interesting in finding the average MEAN variable among all matching tl_2016_us variables. Some quick sample data is shown below:

tl_2016_us MEAN
14136      135.808
14158      132.435
etc.       etc.

It gets tricky in that I only wish to find my averages over a designated winter season, in this case November through March. So all files with 201211, 201212, 201301, 201302 and 201303 in the file name should be matched by tl_2016_us and the corresponding MEAN variables averaged. Ideally, this process would repeat to the next year of 201311, 201312, 201401, 201402, 201403. To this point, I have used

list.files(path = "filepath", pattern ="*ppt*")

to create lists of my filenames for each of the 7 variables.

2

2 Answers

1
votes

I don't really get what the "tl_2016_us" variables are/mean.

However, you can easily get the list of only winter months using a bit of regular expressions like so:

library(tidyverse) 

# Assuming your files are already in your working directory
all_files <- list.files(full.names = TRUE, pattern = "*ppt*")
winter_mos <- str_subset(files, "[01, 02, 03, 11, 12]_\\w{3}$")

After that, you can iterate reading in all files into a data frame with map() from purrr:


library(readxl)

data <- map(winter_mos, ~ read_xlsx(.x)) %>% bind_rows(.id = "id")

After that, you should be able to select the variables you want, use group_by() to group by id (i.e. id of each Excel file), and then summarize_all(mean)

0
votes

Maybe something like (not very elegant):

filetypes = c("ppt", "vpdmax", "vpdmin", "tmax", "tmin", "tdmean", "tmean")
data_years = c(2012,2013,2014)

df <- NULL

for (i in 1:length(data_years)) {
  yr <- data_years[i]
  datecodes <- c(paste(yr,"11",sep=""), 
                 paste(yr,"12",sep=""),                 
                 paste(yr+1,"01",sep=""),
                 paste(yr+1,"02",sep=""),
                 paste(yr+1,"03",sep=""))
  for (j in 1:length(filetypes)) {
     filetype <- filetypes[j]
     file_prefix <- paste("PRISM",filetype,"stable_4kmM3",sep="_")

     for (k in 1:length(datecodes)) {           
        datecode <- datecodes[k]
        filename <- paste(file_prefix,datecode,"bil",sep="_")
        dk <- read_excel(filename)
        M <- dim(dk)[1]
        dk$RefYr <- rep(yr,M)
        dk$DataType <- rep(filetype,M)
        if (is.null(df_new)) {
          df <- dk
        } else {
          df <- rbind(df,dk) 
        }
     }
  }
}

Once that has run, you will have a data frame containing all the data you need to compute your averages (I think).

You could then do something like:

df_new <- NULL

for (i in 1:length(data_years)) {
  yr <- data_years[i]
  di <- df[df$RefYr==yr,]
  for (j in 1:length(filetypes)) {
     filetype <- filetypes[j]
     dj <- di[di$DataType==filetype,]
     tls <- unique(dj$tl_2016_us)
     for (k in 1:length(tls)) {
       tl <- tls[k] 
       dk <- dj[dj$tl_2016_us==tl,]
       dijk <- data.frame(RefYr=yr,TL2016=tl,DataType=filetype,
                          SeasonAverage=mean(dk$MEAN))
       if (is.null(df)){
         df_new <- dijk
       } else {
         df_new <- rbind(df_new,dijk) 
       }
     }
  }
}

I'm sure there are more elegant ways to do it and that there are some bugs in the above since I couldn't really run the code, but I think you should be left with a data frame containing what you are looking for.