0
votes

When I was using the carrot2 web-app clustering my own data with lucene index,I found the result are not the same as my expected.

error one: [In the results list on the right lists only the cluster file name without a matching text passages and file locations,I'm not sure what caused the problem, I guess maybe when I use lucene to create an index file format is wrong, or is my configuration carrot2 web-app project there is a problem, I hope someone can tell me the answer][sorry i can't po my picture for that,you can look the picture in error two.]

error two: I found my search results showed that "other topics" not only a specific topic, it bothers me. I think there might be a problem clustering algorithm or is the topic of test data I have provided too little reason.

When I use the K-means clustering algorithm, the results came out a lot of topics, but no specific topic name but only the file name.

If someone can answer my doubts, I would appreciate it so much,and your answer will be helpful.

this is my code for creating lucene index files:

  package test2;

import org.apache.lucene.index.IndexFileNames;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Version;
import org.carrot2.source.lucene.SimpleFieldMapper;

import java.io.File;
import java.io.FileFilter;
import java.io.IOException;
import java.io.FileReader;


//lucene 4.9
public class LuceneDemo2 {
    public static void main(String[] args) throws Exception {
        String indexDir = "D:\\data\\lucene\\odp\\index-all";
        String dataDir = "D:\\data";

        long start = System.currentTimeMillis();
        LuceneDemo2 indexer = new LuceneDemo2(indexDir);

        int numIndexed;
        try {
            numIndexed = indexer.index(dataDir,new TextFilesFilter());
        } finally {
            indexer.close();
        }
        long end = System.currentTimeMillis();

        System.out.println("Indexing " + numIndexed + " files took " + (end-start) + " milliseconds.");
    }

    private IndexWriter writer;

    public LuceneDemo2(String indexDir) throws IOException {
        // TODO Auto-generated constructor stub
        Directory directory = FSDirectory.open(new File(indexDir));
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,analyzer);
        config.setOpenMode(OpenMode.CREATE);
        writer = new IndexWriter(directory,config);
    }

    public void close() throws IOException {
        writer.close();
    }

    public int index (String dataDir,FileFilter filter) throws Exception {
        File[] files = new File(dataDir).listFiles();

        //if(files == null) return writer.numDocs();
        for(File f: files) {
            if(!f.isDirectory()&&
                !f.isHidden()&&
                f.exists()&&
                f.canRead()&&
                (filter == null || filter.accept(f))) {
                indexFile(f);
            }
        }

        /*
        if(files == null) return writer.numDocs();
        for(int i=0;i<files.length&&files!=null;i++) {
            if(!files[i].isDirectory()&&
                !files[i].isHidden()&&
                files[i].exists()&&
                files[i].canRead()&&
                (filter == null || filter.accept(files[i]))) {
                indexFile(files[i]);
            }
        }
        */
        return writer.numDocs();
    }

    private static class TextFilesFilter implements FileFilter {
        public boolean accept(File path) {
            return path.getName().toLowerCase().endsWith(".txt");
        }   
    }

    private Document getDocument(File f) throws Exception {
        // TODO Auto-generated method stub
        Document document = new Document();
        document.add(new StringField("path",  f.getAbsolutePath(), Field.Store.YES));
        document.add(new LongField("modified", f.lastModified(), Field.Store.NO)); 
        document.add(new TextField("content", new FileReader(f)));
        document.add(new TextField("title", f.getName(), Field.Store.YES));

        return document;
    }

    private void indexFile(File f) throws Exception {
        // TODO Auto-generated method stub
        System.out.println("Indexing "+ f.getCanonicalPath());
        Document document = getDocument(f);
        writer.addDocument(document);
    }   
}

that's my indexing PDF files code(part of it):

private void indexFile(File f) throws Exception {
        // TODO Auto-generated method stub
        System.out.println("Indexing "+ f.getCanonicalPath());
        //Document d = LucenePDFDocument.getDocument(f);
        String executeStr = "D:\\xpdf\\xpdfbin-win-3.04\\bin64\\pdftotext.exe";
        String[] cmd = new String[]{executeStr,"-enc","UTF-8","-q",f.getAbsolutePath(),"-"};  
        String str = null ; 
        Process p = null ;     
        BufferedReader br = null ;  
        StringBuffer sb = new StringBuffer() ;
        try {  
            p = Runtime.getRuntime().exec(cmd) ;               
            br = new BufferedReader(new InputStreamReader(p.getInputStream(),"UTF-8")) ;    
            while((str = br.readLine() ) != null ){  
                sb.append(str).append("\n") ;  
            }               
        } catch (IOException e) {  
            // TODO Auto-generated catch block  
            e.printStackTrace();  
        } finally{  
            if (br != null){  
                try {  
                    br.close() ;  
                } catch (IOException e) {  
                    // TODO Auto-generated catch block  
                    e.printStackTrace();  
                }  
            }  
        }  
        String content = sb.toString();
        Document document = new Document();
        document.add(new StringField("url",  f.getAbsolutePath(), Store.YES));
        document.add(new TextField("content", content,Store.YES));
        document.add(new TextField("title", f.getName(), Store.YES));
        writer.addDocument(document);
    }   
1
Today I find the error one by changing document.add(new StringField("path", f.getAbsolutePath(), Field.Store.YES)); to document.add(new StringField("url", f.getAbsolutePath(), Field.Store.YES));魏世康

1 Answers

1
votes

Carrot2 algorithms operate on the raw text of the documents, so all the content fields you'd like to cluster need to be stored (Field.Store.YES). To have your "content" field stored in the index, the easiest solution would be to read the contents of the corresponding file into a String and then use the String-based constructor of the TextField class.

Once you re-index your content and set Carrot2 to cluster based on your "title" and "content" fields, you should see some meaningful clusters.