0
votes

I have a character verctor I want to transform into a data frame. It's mostly clean but I can't figure out how to finish the cleaning. Notice that the real data are a Date column as yyyy-mm-dd and a Variable column as a number (in this case four digits but not always) separated by a comma.

class(myvec)
[1] "character"
myvec
[1] " \"2016-01-01,8631n\" " " \"2016-01-02,8577n\" "
[3] " \"2016-01-03,8476n\" " " \"2016-01-04,8365n\" "
[5] " \"2016-01-05,8331n\" " " \"2016-01-06,8801n\" "
[7] " \"2016-01-07,5020n\"" 

The space and backslash" (' \"') should be removed. The same with the n\" The expected output should be a data frame like this

    Date         Variable  
[1,] "2016-01-01" "8631"
[2,] "2016-01-02" "8577"
[3,] "2016-01-03" "8476"
[4,] "2016-01-04" "8365"
[5,] "2016-01-05" "8331"
[6,] "2016-01-06" "8801"
[7,] "2016-01-07" "5020"

Once the vector is clan, I think this does the job

do.call(rbind,strsplit(clean_vector,","))

I think I can convert to date with lubridate and the var to numeric with as.numeric on my own, the question is about getting the character vector clean and in the correct format.

1
gsub("[n \"]","",x) # "2016-01-01,8631" works fine for the first one. You could also just use substr since all your objects seem to be fixed-width. - Frank
@Frank Please post an answer, this is great! also maybe provide some explanation - Matias Andina

1 Answers

3
votes

You can remove the offending characters by enumerating them:

# example
x = " \"2016-01-01,8631n\" "

gsub("[n \"]","",x)
# "2016-01-01,8631"

This works because [xyz] identifies any single character from the list xyz.


Or you can take a substring, since the formatting is fixed-width, with bad chars at the start and end:

substr(x,3,17)
# "2016-01-01,8631"

If the var part of the string varies in length, nchar(x)-3 should work in place of 17.