1
votes

I'm using text mining packages to read a group of PDF documents into plaintext, and I want to export this plaintext to a dataframe/CSV/text files (to facilitate further analysis with RTextTools)

First, I pulled PDF documents into a VCorpus using the tm package. The tm package's VCorpus object stores lists containing a "PlainTextDocument" and "TextDocument" object for metadata and plaintext. I.e. "Metadata: DocumentName1"... and the content, "The terms of X are...".

   library(tm)

    docs <- VCorpus(DirSource(getwd()),readerControl = list(reader = readPDF))
    # Creates large VCorpus containing ~700 PlainTextDocuments 
    # (which contain strings/character vectors)

Unclear how to process this into a dataframe, so I managed to hunt down a package with a utility function to convert it into a list of strings.

   library(textreg)
   strings <- convert.tm.to.character(docs)
   # Converts VCorpus to large list of strings with document content

From either the VCorpus or this list of strings, I'd like to create a data frame of just one row, each containing a document's text, with column names corresponding to their original filename.

First I looked at this page, Export a list into a CSV or TXT file in R, and tried using sapply:

df <- data.frame(text = sapply(docs, as.character), stringsAsFactors = FALSE)
    ^Error during wrapup: arguments imply differing number of rows: 1, 5, 3, 3889, 3366

I've also found related threads (R tm package vcorpus: Error in converting corpus to data frame), but found them difficult since they tend to use simpler Corpus objects.

Is there a simpler way I can transform my list of strings or VCorpus to a dataframe, say using dplyr/tidyr/purrr?

Any suggestions on improving my hacked-together solution much appreciated.

Edit: Sample of data

Each element of my list contains a string(/chr vector) with a full document in text. For example,

 strings[3] 

yields this output

[16] "Table of Contents"
[17] "Page"
[18] ""
[19] "Contracting Parties"
[20] ""
[21] "5"
. . .

[379] "“Affiliate†means:"
[380] "(a)"
[381] ""
[382] "a company or any other entity in which any of the Parties holds, either directly or indirectly, the absolute"
[383] "majority of the votes in the shareholders’ meeting or is the holder of more than fifty percent (50%) of the rights"
[384] "and interests which confer the power of management on that company or entity, or has the power of"
[385] "management and control over such company or entity;"

1

1 Answers

0
votes

This should do the trick:

#dummy data generation: file names and a list of strings (your corpus)    
files <- paste("file", 1:6)

strings <- list("a","b","c", "d","e","f")
names(strings) <-files
t(as.data.frame(unlist(strings)))

#             file 1 file 2 file 3 file 4 file 5 file 6
# unlist(strings) "a"    "b"    "c"    "d"    "e"    "f"  

Edit based on data structure edit

files <- paste("file", 1:6)

strings <- list(c("a","b"),c("c", "d"),c("e","f"),
                c("g","h"), c("i","j"), c("k", "l"))

names(strings) <-files
t(data.frame(Doc=sapply(strings, paste0, collapse = " "))) 

#     file 1 file 2 file 3 file 4 file 5 file 6
# Doc "a b"  "c d"  "e f"  "g h"  "i j"  "k l"