Breaking up column of strings into multiple columns (no standard format)?

Question

I have an excel file which I have imported into R. The problem is that there are over 100,000 rows that have data that looks like the following in a single cell:

{'gender': 'Male', 'nationality': 'IRL', 'document_type': 'passport', 'date_of_expiry': '2019-08-12', 'issuing_country': 'IRL'}

So what happens is that R also places this entire information into one box in my table. I would like R to break this single column into multiple columns such that 'gender', 'nationality', 'document_type' etc. each have their own column with entries 'Male', 'IRL', 'passport' etc. respectively for this row. The problem is that some information is missed out in other cells e.g. another cell may look like this:

{'gender': 'Female', 'document_type': 'driving_licence', 'date_of_expiry': '2023-02-28', 'issuing_country': 'GBR'}

and another like this:

{'document_type': 'driving_licence', 'date_of_expiry': '2044-12-14', 'issuing_country': 'GRC'}

How do I assign these entries to their own columns and have empty values where no information is provided. I am a beginner R user. I have tried using package "qdapRegex" because they have a handy tool where I can extract characters between two other characters/strings but to no avail. I think if I could figure out how to apply operators like '|' to strings I would be able to do this with my way.

After Input:

input <- read.csv(text=r"(
ID, Properties
1,"{'gender': 'Male', 'nationality': 'IRL', 'document_type': 'passport', 'date_of_expiry': '2019-08-12', 'issuing_country': 'IRL'}"
2,"{'document_type': 'national_identity_card', 'date_of_expiry': '2027-01-23', 'issuing_country': 'CYP'}"
3,"{'date_of_expiry': '2019-01-28', 'issuing_country': 'JEY'}"
4,"{'document_type': 'driving_licence', 'date_of_expiry': '2006-06-14', 'issuing_country': 'POL'})")

Desired Output:

output <- read.csv(text="
ID,gender,nationality,document_type,date_of_expiry,issuing_country
1,Male,IRL,passport,2019-08-12,IRL
2,,,national_identity_card,2027-01-23,CYP
3,,,,2019-01-28,JEY
4,,,driving_licence,2006-06-14,POL
")

It looks like you have a column of JSON data. There are JSON parsers in R that can help. It's easier to help you if you include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions. — MrFlick
That was much better, I turned the data into proper R objects. All those pipes and dashes still make it hard to copy/paste into R for testing. We're trying to prevent having to retype of bunch of stuff just to test the code. — MrFlick

MrFlick MrFlick · Accepted Answer · 2021-07-16T19:29:45

So your data is very JSON like. The only "problem" is that it uses single quotes rather than double quotes. If we swap the quotes, we can more easily parse the data. Here's some code using a few helper pacakges to get the job done

library(purrr)
library(dplyr)
library(jsonlite)
library(tidyr)

input %>% 
  mutate(Properties = gsub("'", "\"", Properties)) %>% 
  mutate(Properties = map(Properties, fromJSON)) %>% 
  unnest_wider(Properties)

#      ID gender nationality document_type          date_of_expiry issuing_country
#   <int> <chr>  <chr>       <chr>                  <chr>          <chr>          
# 1     1 Male   IRL         passport               2019-08-12     IRL            
# 2     2 NA     NA          national_identity_card 2027-01-23     CYP            
# 3     3 NA     NA          NA                     2019-01-28     JEY            
# 4     4 NA     NA          driving_licence        2006-06-14     POL

When we parse the JOSN we get a named list of values. The tidyr::unnest_wider function will turn that list of values into columns.

Breaking up column of strings into multiple columns (no standard format)?

1 Answers