How to modify this code to include date in a new column?

Question

I have unstructured text file that I need to extract some data from and put in structured format. The data look as below (each record expands on more than one row:

21 March 2017 23:10:45 text 21 March 2017 23:10:45 More text…. . 21 March 2017 23:10:45 And more text 21 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4

22 March 2017 23:10:45 text 22 March 2017 23:10:45 More text…. . 23 March 2017 23:10:45 And more text 23 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4

The code below extracts everything after the word "Message" in a seperate columns (more text1, more text2, more text3, more text4). I would like to modify it to include the date just before the word "Message". Here is the code I have:

#Read data
m <- SReadLines("C:/user...", SkipNull=TRUE)

#reomve special characters that might affect reading the data later:
m <- sapply(m, function(i) {
b <- gsub("\032"," ",i)
gsub("\t","",b)
})

#convert to one big character string
m <- paste(m, collapse="")

#since some entries expand on multiple lines, will replace the date
#(which prepend each piece of information in the file) with a carrot, 
#the replace     new line characters with blanks, then replace carrots 
#with new lines. At the end all texts will on one line:

date_pattern <- "\\[[0-9]{2}\\-[A-Z]{1}[a-z]{2}\\-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"

m <- gsub(data+pattern, "^", m)
m <- gsub("\n","",m)
m <- gsub("\\^", "\n", m)


#only keep lines with the word "Message"
m <- a[Grep("Message",m)]
class(m) <- "character"
#remove the word "message and trim leading white space:
m <- sapply(strsplit(m,split = "Message", fixed=TRUE), function(i) (i[2]))
m <- trimws(m, which="left")

#write to file:
writeLines(m, "C:/user...")

The result of the above code is everything after the word "Message" (more text1, more text2, more text3, more text4) each in a separate column.

I need to modify the above code to add the date as well, any suggestions? I was able to extract the date by itself and tried merging it to the data I extracted using cbind but I got the day in one column, month in a second column, and the year in a third column.

ekstroem ekstroem · Accepted Answer · 2017-03-25T22:47:09

Here's some perl tricks making use of greedy matching that might help you out.

First get some data to test on

x <- "21 March 2017 23:10:45 text 21 March 2017 23:10:45 More text. 21 March 2017 23:10:45 And more text 21 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4"

Then define the date pattern (slightly different from your pattern above. Note months are written out to full length)

date_pattern <- "[0-9]{2} [A-Z]{1}[a-z]+ [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"

Use gsub and back references to get what you want:

gsub(paste0("(.*)(", date_pattern , ")(.*)Message: (.*)"), "\\2  \\4", x)

which yields

"21 March 2017 23:10:45  more text1 more text2 more text3 more text4"

You can insert something in the output from gsub in case you want to split things up more closely.

How to modify this code to include date in a new column?

1 Answers