I am researching whether or not it is possible to automate the scoring of student's code based off of coding style. This includes things like avoiding duplicate code, commented out code, bad naming of variables and more.
We are trying to learn based off of past semester's composition scores (ranging from 1-3), which leads nicely to supervised learning. The basic idea is that we extract features from a student's submissions, and make a feature_vector, then run it through logistic regression using scikit-learn. We also have tried various things including running PCA on the feature vectors to reduce dimensionality.
Our classifier is simply guessing the most frequent class, which is a score of 2. I believe that it's because our features are simply NOT predictive in any way. Is there any other possible reason for a supervised learning algorithm to only guess the dominant class? Is there any way to prevent this?
As I believe it's due to the features not being predictive, is there a way to determine what a "good" feature would be? (And by good, I mean discriminable or predictive).
Note: As a side experiment, we tested how consistent the past grades were by having readers grade assignments that had already been graded. Only 55% of them gave the same composition score (1-3) for the projects. This might mean this dataset is simply not classifiable because humans can't even grade consistently. Any tips on other ideas? Or whether or not that is in fact the case?
Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more... We visualized all of our features and found that while the average is correlated with the score, the variation is really large (not promising).
Edit: Scope of our project: we are only trying to learn from one particular project (with skeleton code given) in one class. We don't need to generalize as of yet.