5
votes

I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.

I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.

To run the sample program I do

$ mahout seqdirectory --input uscensus --output uscensus-seq
$ mahout seq2sparse -i uscensus-seq -o uscensus-vec
$ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25

The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID. Because of the input format I can not use seqdirectory right away. I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:

  1. How do I convert from CSV to SeqFile? I guess I can write my own program using Mahout to make this conversion and then use its output as input for seq2parse. I guess I can use CSVIterator (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations). What class should I use to read and write?
  2. How do I build and run my new program? I couldn't figure it out with the book Mahout in action or with other questions here.
5

5 Answers

5
votes

For getting your data in SequenceFile format, you have a couple of strategies you can take. Both involve writing your own code -- i.e., not strictly command-line.

Strategy 1 Use Mahout's CSVVectorIterator class. You pass it a java.io.Reader and it will read in your CSV file, turn each row into a DenseVector. I've never used this, but saw it in the API. Looks straight-forward enough if you're ok with DenseVectors.

Strategy 2 Write your own parser. This is really easy, since you just split each line on "," and you have an array you can loop through. For each array of values in each line, you instantiate a vector using something like this:

new DenseVector(<your array here>);

and add it to a List (for example).

Then ... once you have a List of Vectors, you can write them to SequenceFiles using something like this (I'm using NamedVectors in below code):

FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();

List<NamedVector> vectors = <here's your List of vectors obtained from CSVVectorIterator>;

// Write the data to SequenceFile
try {
    fs = FileSystem.get(conf);

    Path path = new Path(<your path> + <your filename>);
    writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);

    VectorWritable vec = new VectorWritable();
    for (NamedVector vector : dataVector) {

        vec.set(vector);
        writer.append(new Text(vector.getName()), vec);

    }
    writer.close();

} catch (Exception e) {
    System.out.println("ERROR: "+e);
}

Now you have a directory of "points" in SequenceFile format that you can use for your K-means clustering. You can point the command line Mahout commands at this directory as input.

Anyway, that's the general idea. There are probably other approaches as well.

3
votes

To run kmeans with csv file, first you have to create a SequenceFile to pass as an argument in KmeansDriver. The following code reads each line of the CSV file "points.csv" and converts it into vector and write it to the SequenceFile "points.seq"

try (
            BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
            SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
        ) {
            String line;
            long counter = 0;
            while ((line = reader.readLine()) != null) {
                String[] c = line.split(",");
                if(c.length>1){
                    double[] d = new double[c.length];
                    for (int i = 0; i < c.length; i++)
                            d[i] = Double.parseDouble(c[i]);
                    Vector vec = new RandomAccessSparseVector(c.length);
                    vec.assign(d);

                VectorWritable writable = new VectorWritable();
                writable.set(vec);
                writer.append(new LongWritable(counter++), writable);
            }
        }
        writer.close();
    }

Hope it helps!!

1
votes

There were a few issues when I was running the above code, so with a few modifications in the syntax here is the working code.

String inputfiledata = Input_file_path;
            String outputfile = output_path_for_sequence_file;
            FileSystem fs = null;
            SequenceFile.Writer writer;
            Configuration conf = new Configuration();
            fs = FileSystem.get(conf);
            Path path = new Path(outputfile);`enter code here`
            writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
            VectorWritable vec = new VectorWritable();
            List<NamedVector> vects = new ArrayList<NamedVector>();
            try {
                fr = new FileReader(inputfiledata);
                br = new BufferedReader(fr);
                s = null;
                while((s=br.readLine())!=null){

                    // My columns are split by tabs with each entry in a new line as rows
                    String spl[] = s.split("\\t");
                    String key = spl[0];
                    Integer val = 0;
                    for(int k=1;k<spl.length;k++){
                                colvalues[val] = Double.parseDouble(spl[k]);
                                val++;
                        }
                    }
                    NamedVector nmv = new NamedVector(new DenseVector(colvalues),key);
                    vec.set(nmv);
                    writer.append(new Text(nmv.getName()), vec);
                }
                            writer.close();

            } catch (Exception e) {
                System.out.println("ERROR: "+e);
            }
        }
0
votes

I would suggest you implement a program to convert the CSV to sparse vector sequence file that mahout accepts.
what you need to do is understand how InputDriver converts text files containing space-delimited floating point numbers into Mahout sequence files of VectorWritable suitable for input to the clustering jobs in particular, and any Mahout job requiring this input in general. You will customize the codes to your needs.
If you have downloaded the source code of Mahout, the InputDriver is at package org.apache.mahout.clustering.conversion.

0
votes

org.apache.mahout.clustering.conversion.InputDriver is a class that you can use to create sparse vectors.

Sample code is given below

mahout org.apache.mahout.clustering.conversion.InputDriver -i testdata -o output1/data -v org.apache.mahout.math.RandomAccessSparseVector

If you run mahout org.apache.mahout.clustering.conversion.InputDriver it will list out the parameters it expects.

Hope this helps.
Also here is an article I wrote to explain how I ran kmeans clustering on an arff file
http://mahout-hadoop.blogspot.com/2013/10/using-mahout-to-cluster-iris-data.html