Supervised Learning on Coding Style - Feature Selection (Scikit Learn)

Question

I am researching whether or not it is possible to automate the scoring of student's code based off of coding style. This includes things like avoiding duplicate code, commented out code, bad naming of variables and more.

We are trying to learn based off of past semester's composition scores (ranging from 1-3), which leads nicely to supervised learning. The basic idea is that we extract features from a student's submissions, and make a feature_vector, then run it through logistic regression using scikit-learn. We also have tried various things including running PCA on the feature vectors to reduce dimensionality.

Our classifier is simply guessing the most frequent class, which is a score of 2. I believe that it's because our features are simply NOT predictive in any way. Is there any other possible reason for a supervised learning algorithm to only guess the dominant class? Is there any way to prevent this?

As I believe it's due to the features not being predictive, is there a way to determine what a "good" feature would be? (And by good, I mean discriminable or predictive).

Note: As a side experiment, we tested how consistent the past grades were by having readers grade assignments that had already been graded. Only 55% of them gave the same composition score (1-3) for the projects. This might mean this dataset is simply not classifiable because humans can't even grade consistently. Any tips on other ideas? Or whether or not that is in fact the case?

Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more... We visualized all of our features and found that while the average is correlated with the score, the variation is really large (not promising).

Edit: Scope of our project: we are only trying to learn from one particular project (with skeleton code given) in one class. We don't need to generalize as of yet.

However the answer will be more driven by Statistics here and not Computer Sciences. — Yavar
Please clarify your note at the end of the Q: "tested how consistent the past grades were and determined that they weren't at all". Consistent according to what? How did you test? — dan3
This sounds more like linear regression (numeric prediction) rather than logistic one (classification task). With linear regression you will get numbers like 1.2, 1.8, 1.5, ... instead of simply label "2", which may give you some insights. Also note that linear model (in both - linear and logistic regression) may just be a bad way to represent relations between variables. So you can also try other approaches like splitting data with hyperplanes (SVM, possibly with non-linear kernels) or counting probabilities (e.g. Naive Bayes). BTW, what features do you use (some examples would be helpful). — ffriend

hrs hrs · Accepted Answer · 2013-11-19T12:07:15

Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more..

Have you tried normalizing the features? It seems that you want to train a neural network which is able to classify any given code into a category. Now because different codes will have say, different number of lines of duplicate code and different number of unused variables but may be equally bad. For this reason, you need to normalize your parameters by say, total lines of 'useful' code.

Failing to find good features is very daunting. When stagnant, always follow your intuition. If a human can do a task, so can a computer. Since your features look quite modest for assessing any given code, they ought to work (given that they are used properly).

Summary: Normalization of features should solve the problem.

Supervised Learning on Coding Style - Feature Selection (Scikit Learn)

3 Answers