8
votes

I am researching whether or not it is possible to automate the scoring of student's code based off of coding style. This includes things like avoiding duplicate code, commented out code, bad naming of variables and more.

We are trying to learn based off of past semester's composition scores (ranging from 1-3), which leads nicely to supervised learning. The basic idea is that we extract features from a student's submissions, and make a feature_vector, then run it through logistic regression using scikit-learn. We also have tried various things including running PCA on the feature vectors to reduce dimensionality.

Our classifier is simply guessing the most frequent class, which is a score of 2. I believe that it's because our features are simply NOT predictive in any way. Is there any other possible reason for a supervised learning algorithm to only guess the dominant class? Is there any way to prevent this?

As I believe it's due to the features not being predictive, is there a way to determine what a "good" feature would be? (And by good, I mean discriminable or predictive).

Note: As a side experiment, we tested how consistent the past grades were by having readers grade assignments that had already been graded. Only 55% of them gave the same composition score (1-3) for the projects. This might mean this dataset is simply not classifiable because humans can't even grade consistently. Any tips on other ideas? Or whether or not that is in fact the case?

Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more... We visualized all of our features and found that while the average is correlated with the score, the variation is really large (not promising).

Edit: Scope of our project: we are only trying to learn from one particular project (with skeleton code given) in one class. We don't need to generalize as of yet.

3
+1. Wow! what a question/Yavar
However the answer will be more driven by Statistics here and not Computer Sciences.Yavar
Included "statistics" as a tag. Thanks!stogers
Please clarify your note at the end of the Q: "tested how consistent the past grades were and determined that they weren't at all". Consistent according to what? How did you test?dan3
This sounds more like linear regression (numeric prediction) rather than logistic one (classification task). With linear regression you will get numbers like 1.2, 1.8, 1.5, ... instead of simply label "2", which may give you some insights. Also note that linear model (in both - linear and logistic regression) may just be a bad way to represent relations between variables. So you can also try other approaches like splitting data with hyperplanes (SVM, possibly with non-linear kernels) or counting probabilities (e.g. Naive Bayes). BTW, what features do you use (some examples would be helpful).ffriend

3 Answers

1
votes

Features include: Number of lines of duplicate code, average function length, number of 1 character variables, number of lines that include commented out code, maximum line length, count of unused imports, unused variables, unused parameters. A few more..

Have you tried normalizing the features? It seems that you want to train a neural network which is able to classify any given code into a category. Now because different codes will have say, different number of lines of duplicate code and different number of unused variables but may be equally bad. For this reason, you need to normalize your parameters by say, total lines of 'useful' code.

Failing to find good features is very daunting. When stagnant, always follow your intuition. If a human can do a task, so can a computer. Since your features look quite modest for assessing any given code, they ought to work (given that they are used properly).

Summary: Normalization of features should solve the problem.

1
votes

Just a thought - Andrew Ng teaches a Machine Learning course on Coursera (https://www.coursera.org/course/ml). There are several programming assignments that students submit throughout the class. I remember reading (though unfortunately I can't find the article now) that there was some ongoing research that was attempting to cluster student submitted programming assignments from the class, with the intuition that there are common mistakes that students make on the assignments.

Not sure if this helps you, but perhaps treating this as an unsupervised learning problem might make more sense (e.g., just looking for similarities in different code samples with the intuition that the code samples that are similar should receive a similar score).

0
votes
  1. You want to balance your target classes (a close-to-equal number of 1,2,3 scores). You can randomly sample over-sized classes, bootstrap sample under-sized classes, or use an algorithm that accounts for unbalanced data (not sure which in Python do).

  2. Make sure you are cross-validating to prevent over-fitting

  3. There are a few ways to figure out which attributes are important:

    • try all combinations of attributes, starting with one of them
    • or try all combinations of attributes, starting with them all
    • or try attribute combinations at random (or w genetic algo)

Choose the attribute combo with the highest cross-validated accuracy.

You can also take the product of the attribute columns to see if they cause an effect together.