2
votes

I am working with a data-set of dimension more than 10,000. To use Weka I need to convert text file into ARFF format, but since there are too many attributes even after using sparse ARFF format file size is too large. Is there any similar method as for data to avoid writing so many attribute identifier as in header of ARFF file.

for example :
@attribute A1 NUMERICAL
@attribute A2 NUMERICAL
...
...
@attribute A10000 NUMERICAL

1
Maximum number of attributes supported by WEKA. Is this a problem related to creating an arff file (in which case sed or awk might help) or dealing with it directly in weka?chl
@chl Thanks for the response, I am able to generate the arff file but the file size is very large because I have 1,84,000 attributes. I was wondering if there is any method for avoiding adding so many headers in arff file. All attributes are NUMERICAL, so I thought there might be a way.Deepak

1 Answers

0
votes

I coded a script in AWK to format the following lines (in a TXT file) to an ARFF

example.txt source:

Att_0 | Att_1 | Att_2 | ... | Att_n
1 | 2 | 3 | ... | 999

My script (to_arff), you can change FS value depending on the separator used in the TXT file:

#!/usr/bin/awk -f
# ./<script>.awk data.txt > data.arff

BEGIN {
    FS = "|";
    # WEKA separator
    separator = ",";
}

# The first line
NR == 1 {
    # WEKA headers
        split(FILENAME, relation, ".");
        # the relation's name is the source file's name
    print "@RELATION "relation[1]"\n";
    # attributes are "numeric" by default
    # types available: numeric, <nominal> {n1, n2, ..., nN}, string and date [<date-format>]
    for (i = 1; i <= NF; i++) {
        print "@ATTRIBUTE "$i" NUMERIC";
    }
    print "\n@DATA";
}

NR > 1 {
    s = "";
    first = 1;
    for (i = 1; i <= NF; i++) {
        if (first)
            first = 0;
        else
            s = s separator;
        s = s $i;
    }
    print s;
}

Output:

@RELATION example

@ATTRIBUTE Att_0 NUMERIC
@ATTRIBUTE Att_1 NUMERIC
@ATTRIBUTE Att_2 NUMERIC
@ATTRIBUTE Att_n NUMERIC

@DATA
1,2,3,9999