Clean character vector and strsplit into dataframe

Question

I have a character verctor I want to transform into a data frame. It's mostly clean but I can't figure out how to finish the cleaning. Notice that the real data are a Date column as yyyy-mm-dd and a Variable column as a number (in this case four digits but not always) separated by a comma.

class(myvec)
[1] "character"
myvec
[1] " \"2016-01-01,8631n\" " " \"2016-01-02,8577n\" "
[3] " \"2016-01-03,8476n\" " " \"2016-01-04,8365n\" "
[5] " \"2016-01-05,8331n\" " " \"2016-01-06,8801n\" "
[7] " \"2016-01-07,5020n\""

The space and backslash" (' \"') should be removed. The same with the n\" The expected output should be a data frame like this

    Date         Variable  
[1,] "2016-01-01" "8631"
[2,] "2016-01-02" "8577"
[3,] "2016-01-03" "8476"
[4,] "2016-01-04" "8365"
[5,] "2016-01-05" "8331"
[6,] "2016-01-06" "8801"
[7,] "2016-01-07" "5020"

Once the vector is clan, I think this does the job

do.call(rbind,strsplit(clean_vector,","))

I think I can convert to date with lubridate and the var to numeric with as.numeric on my own, the question is about getting the character vector clean and in the correct format.

gsub("[n \"]","",x) # "2016-01-01,8631" works fine for the first one. You could also just use substr since all your objects seem to be fixed-width. — Frank
@Frank Please post an answer, this is great! also maybe provide some explanation — Matias Andina

Frank Frank · Accepted Answer · 2016-01-08T20:13:23

You can remove the offending characters by enumerating them:

# example
x = " \"2016-01-01,8631n\" "

gsub("[n \"]","",x)
# "2016-01-01,8631"

This works because [xyz] identifies any single character from the list xyz.

Or you can take a substring, since the formatting is fixed-width, with bad chars at the start and end:

substr(x,3,17)
# "2016-01-01,8631"

If the var part of the string varies in length, nchar(x)-3 should work in place of 17.

Clean character vector and strsplit into dataframe

1 Answers