I want to build an inverted index in java. I have cran data of 1400 text files. I was able to count the frequency of each term/word. I have been able to return the number times a word appears in the entire collection, but I have not been able to return which documents the word appears in. This is the code I have so far:
I want the output in the following form term1: doc1:2, doc2:3 term2: doc1:3, doc4:1 ............... so on
here term is a word in a doc file and doc 1:2 means term1 appears in doc 1 2 times
public static void main(String[]args) throws FileNotFoundException{
Map<String, Integer> m = new HashMap<>();
String wrd;
for(int i=1;i<=2;i++){
//FileInputStream tdfr = new FileInputStream("D:\\logs\\steem"+i+".txt");
Scanner tdsc=new Scanner(new File("D:\\logs\\steem"+i+".txt"));
while(tdsc.hasNext()){
// m.clear();
Integer docid=i;
wrd=tdsc.next();
//Vector<Integer> vPosList = p.hPosList.get(wrd);
Integer freq=m.get(wrd);
//Integer doc=m1.get(i);
//System.out.println(m.get(wrd));
m.put(wrd, (freq == null) ? 1 : freq + 1);
}
System.out.println(m.size() + " distinct words" + " steem" +i);
System.out.println("Doc" +i+""+m);
//System.out.println("Doc"+i+""+m1);
m.clear();
tdsc.close();
}
//System.out.println(m.size() + " distinct words");
//System.out.println(m);
// System.out.println(m1);
}
}