I have unstructured text file that I need to extract some data from and put in structured format. The data look as below (each record expands on more than one row:
21 March 2017 23:10:45 text 21 March 2017 23:10:45 More text…. . 21 March 2017 23:10:45 And more text 21 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4
22 March 2017 23:10:45 text 22 March 2017 23:10:45 More text…. . 23 March 2017 23:10:45 And more text 23 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4
The code below extracts everything after the word "Message" in a seperate columns (more text1, more text2, more text3, more text4). I would like to modify it to include the date just before the word "Message". Here is the code I have:
#Read data
m <- SReadLines("C:/user...", SkipNull=TRUE)
#reomve special characters that might affect reading the data later:
m <- sapply(m, function(i) {
b <- gsub("\032"," ",i)
gsub("\t","",b)
})
#convert to one big character string
m <- paste(m, collapse="")
#since some entries expand on multiple lines, will replace the date
#(which prepend each piece of information in the file) with a carrot,
#the replace new line characters with blanks, then replace carrots
#with new lines. At the end all texts will on one line:
date_pattern <- "\\[[0-9]{2}\\-[A-Z]{1}[a-z]{2}\\-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"
m <- gsub(data+pattern, "^", m)
m <- gsub("\n","",m)
m <- gsub("\\^", "\n", m)
#only keep lines with the word "Message"
m <- a[Grep("Message",m)]
class(m) <- "character"
#remove the word "message and trim leading white space:
m <- sapply(strsplit(m,split = "Message", fixed=TRUE), function(i) (i[2]))
m <- trimws(m, which="left")
#write to file:
writeLines(m, "C:/user...")
The result of the above code is everything after the word "Message" (more text1, more text2, more text3, more text4) each in a separate column.
I need to modify the above code to add the date as well, any suggestions? I was able to extract the date by itself and tried merging it to the data I extracted using cbind but I got the day in one column, month in a second column, and the year in a third column.