I am trying to create a corpus from Java source code.
I am following the preprocessing steps in this paper http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf
Based on the section [2.1] the following thing should be removed:
- characters related to the syntax of the programming language [already done by removePunctuation]
- programming language keywords [already done by tm_map(dsc, removeWords, javaKeywords)]
- common English-language stopwords [already done by tm_map(dsc, removeWords, stopwords("english"))]
- word stemming [already done by tm_map(dsc, stemDocument)]
The remaining part is to split identifier and method names into multiple parts based on common naming conventions.
For example 'firstName' should be split into 'first' and 'name'.
Another example 'calculateAge' should be split into 'calculate' and 'age'.
Can anybody help me with this?
library(tm)
dd = DirSource(pattern = ".java", recursive = TRUE)
javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
dsc <- Corpus(dd)
dsc <- tm_map(dsc, stripWhitespace)
dsc <- tm_map(dsc, removePunctuation)
dsc <- tm_map(dsc, removeNumbers)
dsc <- tm_map(dsc, removeWords, stopwords("english"))
dsc <- tm_map(dsc, removeWords, javaKeywords)
dsc = tm_map(dsc, stemDocument)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))