1
votes

I am trying to create a corpus from Java source code.
I am following the preprocessing steps in this paper http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf

Based on the section [2.1] the following thing should be removed:
- characters related to the syntax of the programming language [already done by removePunctuation]
- programming language keywords [already done by tm_map(dsc, removeWords, javaKeywords)]
- common English-language stopwords [already done by tm_map(dsc, removeWords, stopwords("english"))]
- word stemming [already done by tm_map(dsc, stemDocument)]

The remaining part is to split identifier and method names into multiple parts based on common naming conventions.

For example 'firstName' should be split into 'first' and 'name'.

Another example 'calculateAge' should be split into 'calculate' and 'age'.
Can anybody help me with this?

    library(tm)
    dd = DirSource(pattern = ".java", recursive = TRUE)
    javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
    dsc <- Corpus(dd)
    dsc <- tm_map(dsc, stripWhitespace)
    dsc <- tm_map(dsc, removePunctuation)
    dsc <- tm_map(dsc, removeNumbers)
    dsc <- tm_map(dsc, removeWords, stopwords("english"))
    dsc <- tm_map(dsc, removeWords, javaKeywords)
    dsc = tm_map(dsc, stemDocument)
    dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))
3

3 Answers

1
votes

You can create a custom function to split words(vectorized here) by Capital letter:

splitCapital  <- function(x) 
     unlist(strsplit(tolower(sub('(.*)([A-Z].*)','\\1 \\2',x)),' '))

Example:

splitCapital('firstName')
[1] "first" "name" 

splitCapital(c('firstName','calculateAge'))
[1] "first"     "name"      "calculate" "age"  

Then you can iterate over your corpus:

corpus.split <- lapply(dsc,splitCapital)
1
votes

I've written a tool in Perl to do all kinds of source code preprocessing, including identifier splitting:

https://github.com/stepthom/lscp

The relevant piece of code there is:

=head2 tokenize
 Title    : tokenize
 Usage    : tokenize($wordsIn)
 Function : Splits words based on camelCase, under_scores, and dot.notation.
          : Leaves other words alone.
 Returns  : $wordsOut => string, the tokenized words
 Args     : named arguments:
          : $wordsIn => string, the white-space delimited words to process
=cut
sub tokenize{
    my $wordsIn  = shift;
    my $wordsOut = "";

    for my $w (split /\s+/, $wordsIn) {
        # Split up camel case: aaA ==> aa A
        $w =~ s/([a-z]+)([A-Z])/$1 $2/g;

        # Split up camel case: AAa ==> A Aa
        # Split up camel case: AAAAa ==> AAA Aa
        $w =~ s/([A-Z]{1,100})([A-Z])([a-z]+)/$1 $2$3/g;

        # Split up underscores 
        $w =~ s/_/ /g;

        # Split up dots
        $w =~ s/([a-zA-Z0-9])\.+([a-zA-Z0-9])/$1 $2/g;

        $wordsOut = "$wordsOut $w";
    }

    return removeDuplicateSpaces($wordsOut);
}

The above hacks are based on my own experience with preprocessing source code for textual analysis. Feel free to steal and modify.

0
votes

I realize this is an old question and the OP has either solved their problem or moved on, but in case someone else comes across this question and is seeking an identifier splitting package, I would like to offer Spiral ("SPlitters for IdentifieRs: A Library"). It is written in Python but comes with a command-line utility that can read a file of identifiers (one per line) and split each one.

Splitting identifiers is deceptively difficult. It's actually a research-grade problem for which no perfect solution exists today. Even in cases where the input consists of identifiers that follow some convention such as camel case, ambiguities can arise—and of course, things are much harder when source code does not follow a consistent convention.

Spiral implements numerous identifier splitting algorithms, including a novel algorithm called Ronin. It uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin can split identifiers that do not use camel case or other naming conventions, including cases such as splitting J2SEProjectTypeProfiler into [J2SE, Project, Type, Profiler], which requires the reader to recognize J2SE as a unit. Here are some more examples of what Ronin can split:

# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']

If you want simple strict camel-case or other simpler splitters, Spiral offers several of those too. Please see the GitHub page for more information.