I'm using text mining packages to read a group of PDF documents into plaintext, and I want to export this plaintext to a dataframe/CSV/text files (to facilitate further analysis with RTextTools)
First, I pulled PDF documents into a VCorpus using the tm package. The tm package's VCorpus object stores lists containing a "PlainTextDocument" and "TextDocument" object for metadata and plaintext. I.e. "Metadata: DocumentName1"... and the content, "The terms of X are...".
library(tm)
docs <- VCorpus(DirSource(getwd()),readerControl = list(reader = readPDF))
# Creates large VCorpus containing ~700 PlainTextDocuments
# (which contain strings/character vectors)
Unclear how to process this into a dataframe, so I managed to hunt down a package with a utility function to convert it into a list of strings.
library(textreg)
strings <- convert.tm.to.character(docs)
# Converts VCorpus to large list of strings with document content
From either the VCorpus or this list of strings, I'd like to create a data frame of just one row, each containing a document's text, with column names corresponding to their original filename.
First I looked at this page, Export a list into a CSV or TXT file in R, and tried using sapply:
df <- data.frame(text = sapply(docs, as.character), stringsAsFactors = FALSE)
^Error during wrapup: arguments imply differing number of rows: 1, 5, 3, 3889, 3366
I've also found related threads (R tm package vcorpus: Error in converting corpus to data frame), but found them difficult since they tend to use simpler Corpus objects.
Is there a simpler way I can transform my list of strings or VCorpus to a dataframe, say using dplyr/tidyr/purrr?
Any suggestions on improving my hacked-together solution much appreciated.
Edit: Sample of data
Each element of my list contains a string(/chr vector) with a full document in text. For example,
strings[3]
yields this output
[16] "Table of Contents"
[17] "Page"
[18] ""
[19] "Contracting Parties"
[20] ""
[21] "5"
.
.
.
[379] "“Affiliate†means:"
[380] "(a)"
[381] ""
[382] "a company or any other entity in which any of the Parties holds, either directly or indirectly, the absolute"
[383] "majority of the votes in the shareholders’ meeting or is the holder of more than fifty percent (50%) of the rights"
[384] "and interests which confer the power of management on that company or entity, or has the power of"
[385] "management and control over such company or entity;"