0
votes

I have unstructured text file that I need to extract some data from and put in structured format. The data look as below (each record expands on more than one row:

21 March 2017 23:10:45 text 21 March 2017 23:10:45 More text…. . 21 March 2017 23:10:45 And more text 21 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4

22 March 2017 23:10:45 text 22 March 2017 23:10:45 More text…. . 23 March 2017 23:10:45 And more text 23 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4

The code below extracts everything after the word "Message" in a seperate columns (more text1, more text2, more text3, more text4). I would like to modify it to include the date just before the word "Message". Here is the code I have:

#Read data
m <- SReadLines("C:/user...", SkipNull=TRUE)

#reomve special characters that might affect reading the data later:
m <- sapply(m, function(i) {
b <- gsub("\032"," ",i)
gsub("\t","",b)
})

#convert to one big character string
m <- paste(m, collapse="")

#since some entries expand on multiple lines, will replace the date
#(which prepend each piece of information in the file) with a carrot, 
#the replace     new line characters with blanks, then replace carrots 
#with new lines. At the end all texts will on one line:

date_pattern <- "\\[[0-9]{2}\\-[A-Z]{1}[a-z]{2}\\-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"

m <- gsub(data+pattern, "^", m)
m <- gsub("\n","",m)
m <- gsub("\\^", "\n", m)


#only keep lines with the word "Message"
m <- a[Grep("Message",m)]
class(m) <- "character"
#remove the word "message and trim leading white space:
m <- sapply(strsplit(m,split = "Message", fixed=TRUE), function(i) (i[2]))
m <- trimws(m, which="left")

#write to file:
writeLines(m, "C:/user...")

The result of the above code is everything after the word "Message" (more text1, more text2, more text3, more text4) each in a separate column.

I need to modify the above code to add the date as well, any suggestions? I was able to extract the date by itself and tried merging it to the data I extracted using cbind but I got the day in one column, month in a second column, and the year in a third column.

1

1 Answers

1
votes

Here's some perl tricks making use of greedy matching that might help you out.

First get some data to test on

x <- "21 March 2017 23:10:45 text 21 March 2017 23:10:45 More text. 21 March 2017 23:10:45 And more text 21 March 2017 23:10:45 some more text Message: more text1 more text2 more text3 more text4"

Then define the date pattern (slightly different from your pattern above. Note months are written out to full length)

date_pattern <- "[0-9]{2} [A-Z]{1}[a-z]+ [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}"

Use gsub and back references to get what you want:

gsub(paste0("(.*)(", date_pattern , ")(.*)Message: (.*)"), "\\2  \\4", x)

which yields

"21 March 2017 23:10:45  more text1 more text2 more text3 more text4"

You can insert something in the output from gsub in case you want to split things up more closely.