2
votes

I'm trying to get Weka to predict from the command line, but I'm concerned I might be doing this wrong. I read the Data Mining book and searched their site for documentation, yet what I found was vague at best, so I hope you can help me.

First, I created a training set (train.arff). Here's a sample:

@relation test
@attribute 'A' {0,1}
@attribute 'B' {0,1}
@attribute 'C' {0,1}
@attribute 'D' {0,1}
@attribute 'E' {0,1}
@attribute 'F' {0,1}
@data
0,0,0,0,0,0
0,0,0,0,0,0
...

Then I created data set to be completed by prediction (test.arff):

@relation test
@attribute 'A' {0,1}
@attribute 'B' {0,1}
@attribute 'C' {0,1}
@attribute 'D' {0,1}
@attribute 'E' {0,1}
@attribute 'F' {0,1}
@data
0,?,0,0,0,0
0,?,0,0,0,0
...

The "?" marks the attribute that should be predicted.

Finally, I attempted to get the predictions by running this on the command line:

java weka.classifiers.trees.J48 -t train.arff -T test.arff -p 0

It produces the following output:

=== Predictions on test data ===

 inst#     actual  predicted error prediction
     1        2:1        2:1       0.939
     2        2:1        2:1       0.939

I then took the number after the ":" in the predicted column for the prediction for the data row marked by inst#.

Here are my questions:

  1. Is this correct? I'm concerned about "?" as I read that it may be imputed (although that may be only during the learning phase).

  2. Does Weka support multiple predictions? No matter how many fields are marked with "?" I always get the same table with only one predicted value per instance.

  3. Can Weka generate a complete (predicted) ARFF file, or do I have to construct this myself from its results?

If I missed something glaringly obvious, apologies in advance and any pointers to relevant documentation would be greatly appreciated.

Thanks in advance!

1

1 Answers

2
votes

The '?' is a generic marker for an unknown value. It can be used in training and test data and tells Weka that in this particular case, the value is not available. What is then done with that information depends on the actual learning algorithm. So to answer your questions:

  1. No. The attribute to predict is specified when training the model through the -c argument. This argument gives the index of the attribute to predict. By default, it's the last one, so 'F' in your case.
  2. No. This is actually more of an issue of the implemented learning algorithms, but none of those in Weka support this. The way to do it is to train multiple models for the different predictions.
  3. This doesn't make sense in this case, because you have to supply the known values in order for Weka to be able to evaluate the accuracy of the classifier. If the values are completely unknown, there's no way of telling how good it is.

Note that you can save a trained model and then use it to make predictions. The latter page also contains the knowledge flow you can construct to save the results of this as an ARFF file.