0
votes

I have this log file that has about 1200 characters (max) on a line. What I want to do is read this first and then extract certain portions of the file into new columns. I want to extract rows that contain the text “[DF_API: input string]”. When I read it and then filter based on the rows that I am interested, it almost seems like I am losing data. I tried this using the dplyr filter and using standard grep with the same result.

Not sure why this is the case. Appreciate your help with this. The code and the data is there at the following link. Satish

Code is given below

library(dplyr)
setwd("C:/Users/satis/Documents/VF/df_issue_dec01")

sec1 <- read.delim(file="secondary1_aa_small.log")
head(sec1)
names(sec1) <- c("V1")
sec1_test <- filter(sec1,str_detect(V1,"DF_API: input string")==TRUE)
head(sec1_test)

sec1_test2 = sec1[grep("DF_API: input string",sec1$V1, perl = TRUE),]
head(sec1_test2)

write.csv(sec1_test, file = "test_out.txt", row.names = F, quote = F)
write.csv(sec1_test2, file = "test2_out.txt", row.names = F, quote = F)

Data (and code) is given at the link below. Sorry, I should have used dput.

https://spaces.hightail.com/space/arJlYkgIev

1

1 Answers

0
votes

Try this below code which could give you a dataframe of filtered lines from your file based a matching condition.

#to read your file
sec1 <- readLines("secondary1_aa_small.log")
#framing a dataframe by extracting required lines from above file
new_sec1 <- data.frame(grep("DF_API: input string", sec1, value = T))
names(new_sec1) <- c("V1")

Edit: Simple way to split the above column into multiple columns

#extracting substring in between < & >
new_sec1$V1 <- gsub(".*[<\t]([^>]+)[>].*", "\\1", new_sec1$V1)
#replacing comma(,) with a white space
new_sec1$V1 <- gsub("[,]+", " ", new_sec1$V1)
#splitting into separate columns
new_sec1 <-  strsplit(new_sec1$V1, " ")
new_sec1 <-  lapply(new_sec1, function(x) x[x != ""] )
new_sec1 <-  do.call(rbind, new_sec1)
new_sec1 <- data.frame(new_sec1)

Change columns names for your analysis.