I have several hundred dataframes, each containing employee information such as age, sex, position, DOB, etc. The issue is, most files have different column headers and data structure.
I am trying to capture as much information that is feasibly possible from each file i.e. not all of it, just maybe 70-80% of it that can be captured efficiently with a simple column match or algorithm.
This data will be merged into one "Master Census Table" containing common columns such as the columns listed above which will then be uploaded to a SQL Server.
My first idea to accomplish this was to create a Master Column key that maps out several variations of one column to a standardized header. For example, the "Sex" column would be matched when the input file's matching column header contained ("gender", "sex", "M/F", "Male/Female").
Any matched columns would get added to the master table, leaving blank cells in columns where no match was found.
What is the best approach to accomplish this? "If" statements in a loop for each column header that checks for matches in the column translation key? Are there more suited functions/packages for this?
I am struggling to wrap my head around pulling whole columns into a new dataframe based on conditional logic without writing convoluted "if, then" loops while also making sure data structure is coerced and each column is standardized i.e. the "Sex" column only contains "Male" and "Female", not a mix of "M", "F", "Boy", "Girl" etc.
Any help/input/discussion is appreciated.
I can take some time to add sample data if anyone would like some better visualization of what I mean.
R Version 3.4.4
dplyr::select
functionality with thematches
argument to specify a vector of patterns to match for each of the fields you're trying to capture. As you do that, normalize your column names and values. Assuming some files have more columns than others, you need to decide which ones are important to keep. For files that are missing those values, you can populate those rows withNA
values. – Mako212