Split Identifier and Method Names in Creating Source Code Corpus

Question

I am trying to create a corpus from Java source code.
I am following the preprocessing steps in this paper http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf

Based on the section [2.1] the following thing should be removed:
- characters related to the syntax of the programming language [already done by removePunctuation]
- programming language keywords [already done by tm_map(dsc, removeWords, javaKeywords)]
- common English-language stopwords [already done by tm_map(dsc, removeWords, stopwords("english"))]
- word stemming [already done by tm_map(dsc, stemDocument)]

The remaining part is to split identifier and method names into multiple parts based on common naming conventions.

For example 'firstName' should be split into 'first' and 'name'.

Another example 'calculateAge' should be split into 'calculate' and 'age'.
Can anybody help me with this?

    library(tm)
    dd = DirSource(pattern = ".java", recursive = TRUE)
    javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
    dsc <- Corpus(dd)
    dsc <- tm_map(dsc, stripWhitespace)
    dsc <- tm_map(dsc, removePunctuation)
    dsc <- tm_map(dsc, removeNumbers)
    dsc <- tm_map(dsc, removeWords, stopwords("english"))
    dsc <- tm_map(dsc, removeWords, javaKeywords)
    dsc = tm_map(dsc, stemDocument)
    dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))

agstudy agstudy · Accepted Answer · 2014-09-20T23:53:19

You can create a custom function to split words(vectorized here) by Capital letter:

splitCapital  <- function(x) 
     unlist(strsplit(tolower(sub('(.*)([A-Z].*)','\\1 \\2',x)),' '))

Example:

splitCapital('firstName')
[1] "first" "name" 

splitCapital(c('firstName','calculateAge'))
[1] "first"     "name"      "calculate" "age"

Then you can iterate over your corpus:

corpus.split <- lapply(dsc,splitCapital)

Split Identifier and Method Names in Creating Source Code Corpus

3 Answers