0
votes

I am extracting data from multiple PDF's using set of search words.

Table_search <- list("Table 14", "Listing [0-9]", "Program") 

Table_match_list <- sapply(Table_search, grep, x = tablelist, value = TRUE)

This code loops through PDF file and searches for the key words and extracts that line from the PDF. I get a difference in length between keywords like the error below. This is due to missing keywords in specific pages, if the code comes across any missing values it should be able to print NA so that code goes to next page and looks for keywords and so on.

If I print NA for blank cells then my final out put should have equal number of rows for each keyword I search for.

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 102, 98, 99

I asked to search for three words and the output is 102, 98, 99 respectively. Instead I should have 102 rows for each keyword i search for.

Here 102 because I am looping through 102 PDF files.

Please advise how can we achieve this.

Thank you Bharath

@Ronak ------- Updated This is what I get out of 102 PDF files. 3 Sublists are 3 different keywords. First word is in all PDFs, second word is in 98 PDFs, third one is in 99 PDF's.

enter image description here

This is what I get from your code.

enter image description here

How I need is, It doesn't have to print NULL for every line of PDF. Just one NULL per PDF "If keyword is missing".

enter image description here

TABLELIST IMAGE enter image description here

1
The error I show didnt trigger from the code line i posted here. That comes when I create a data frame using the match list. - Bharath
You can use tryCatch in the function you use for finding the line of the keyword. - Mohan Govindasamy
Can you include output of dput(tablelist) to your post so that I can use the data to verify the answer? - Ronak Shah
Are you sure my suggestion in the previous answer doesn't work. Here's a slightly modified version. Table_match_list <- sapply(Table_search, function(x) {tmp <- grep(x, tablelist, value = TRUE);if(length(tmp) > 0) toString(tmp) else NA}) What does this return? - Ronak Shah
Your previous suggestion actually works, the second image is from the your code, where it prints NULL for every line even though all three keywords are available int he PDF. If you see the third image (Desired output) we just want the NULL printed once, if the keyword is not available in the entire page instead of every line. I think a slight modification of your previous answer should work, which I am unable to figure out. - Bharath

1 Answers

0
votes

You can try using the following :

Table_search <- c("Table 14", "Listing [0-9]", "Program") 
Table_match_list <- sapply(Table_search, function(x) {
                      y <- grepl(x, tablelist)
                      y[!y] <- NA
                      tablelist[y]
                     })