0
votes

I have several hundred dataframes, each containing employee information such as age, sex, position, DOB, etc. The issue is, most files have different column headers and data structure.

I am trying to capture as much information that is feasibly possible from each file i.e. not all of it, just maybe 70-80% of it that can be captured efficiently with a simple column match or algorithm.

This data will be merged into one "Master Census Table" containing common columns such as the columns listed above which will then be uploaded to a SQL Server.

My first idea to accomplish this was to create a Master Column key that maps out several variations of one column to a standardized header. For example, the "Sex" column would be matched when the input file's matching column header contained ("gender", "sex", "M/F", "Male/Female").

Any matched columns would get added to the master table, leaving blank cells in columns where no match was found.

What is the best approach to accomplish this? "If" statements in a loop for each column header that checks for matches in the column translation key? Are there more suited functions/packages for this?

I am struggling to wrap my head around pulling whole columns into a new dataframe based on conditional logic without writing convoluted "if, then" loops while also making sure data structure is coerced and each column is standardized i.e. the "Sex" column only contains "Male" and "Female", not a mix of "M", "F", "Boy", "Girl" etc.

Any help/input/discussion is appreciated.

I can take some time to add sample data if anyone would like some better visualization of what I mean.

R Version 3.4.4

1
I'd recommend using some of the dplyr::select functionality with the matches argument to specify a vector of patterns to match for each of the fields you're trying to capture. As you do that, normalize your column names and values. Assuming some files have more columns than others, you need to decide which ones are important to keep. For files that are missing those values, you can populate those rows with NA values.Mako212
is all the info at the employee level ? is the employeeID the same in every table ? if not do these tables have unique keys ?Moody_Mudskipper
it would help if you give us 3 simplified tables of 3 rows, containing the type of issues you're facing, and the output you would expectMoody_Mudskipper

1 Answers

0
votes

It is difficult to answer this without the data, but I had to do something similar in the past and here is how I did it.

  • Load all data frames, find all variable (header) values
  • Check the structure of the variables in each data frame
  • Store these data in a list, with the first entry containing the data for the header and the data types

This should yield a list of headers and data types all frames.

  • After visually scanning these data, define the structure of the resulting data frame you want (this is your "Master Census Table")

At this point you can:

  • Load each data frame,
  • Check the header
  • Correct variables to the appropriate format, drop unnecessary variables, add necessary ones (you can use rep(NA, each = nrow(df)), or some other value) for missing values.
  • rbind() results

If you have more details then I can answer in greater detail, but mostly what you need is a strategy, and then to write an algorithm (which may, indeed, have many ifelse() calls) to apply it.