Word Frequency of each word in 2GB txt file in UTF-8 Encoding in Java

Question

I am working on project and in there I need to find out the frequency of each word in a large corpus of over 100 Million Bengali words. The file size is around 2GB. I actually need most frequent 20 words and least frequent 20 words with frequency count. I have done the same code in PHP but it is taking so long(the code is still running after a week). Thus, I am trying to do this in Java.

In this code, it should work like follows,

-read a line from corpus nahidd_filtered.txt

-split using whitespace

for each spitted word,read whole frequency file freq3.txt
- if the word found then increase the frequency count and store in that file
- else count = 1 (new word) and store freqeuncy count in that file

I have tried to read chunk of text from nahidd_filtered.txt corpus using loop and the word with frequency is stored in freq3.txt. The freq3.txt file stored frequency count like this,

Word1 Frequncy1 (single whitespace in between)

Word2 Frequency2

...........

Simply speaking, I need top 20 most frequent and 20 least frequent words along with their frequency count from the large corpus file encoded UTF-8. Please check the code and suggest me why this is not working or any other suggestion. Thank you very much.

import java.io.*;
import java.util.*;
import java.util.concurrent.TimeUnit;

public class Main {


private static String fileToString(String filename) throws IOException {
    FileInputStream inputStream = null;
    Scanner reader = null;
    inputStream = new FileInputStream(filename);
    reader = new Scanner(inputStream, "UTF-8");

    /*BufferedReader reader = new BufferedReader(new FileReader(filename));*/
    StringBuilder builder = new StringBuilder();


    // For every line in the file, append it to the string builder
    while (reader.hasNextLine()) {
        String line = reader.nextLine();
        builder.append(line);
    }

    reader.close();
    return builder.toString();
}

public static final String UTF8_BOM = "\uFEFF";

private static String removeUTF8BOM(String s) {
    if (s.startsWith(UTF8_BOM)) {
        s = s.substring(1);
    }
    return s;
}

public static void main(String[] args) throws IOException {

    long startTime = System.nanoTime();
    System.out.println("-------------- Start Contents of file: ---------------------");
    FileInputStream inputStream = null;
    Scanner sc = null;
    String path = "C:/xampp/htdocs/thesis_freqeuncy_2/nahidd_filtered.txt";
    try {
        inputStream = new FileInputStream(path);
        sc = new Scanner(inputStream, "UTF-8");
        int countWord = 0;
        BufferedWriter writer = null;
        while (sc.hasNextLine()) {
            String word = null;
            String line = sc.nextLine();
            String[] wordList = line.split("\\s+");

            for (int i = 0; i < wordList.length; i++) {
                word = wordList[i].replace("।", "");
                word = word.replace(",", "").trim();
                ArrayList<String> freqword = new ArrayList<>();
                String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
                /*freqword = freq.split("\\r?\\n");*/
                Collections.addAll(freqword, freq.split("\\r?\\n"));
                int flag = 0;
                String[] freqwordsp = null;
                int k;
                for (k = 0; k < freqword.size(); k++) {
                    freqwordsp = freqword.get(k).split("\\s+");
                    String word2 = freqwordsp[0];
                    word = removeUTF8BOM(word);
                    word2 = removeUTF8BOM(word2);
                    word.replaceAll("\\P{Print}", "");
                    word2.replaceAll("\\P{Print}", "");
                    if (word2.toString().equals(word.toString())) {

                        flag = 1;
                        break;
                    }
                }

                int count = 0;
                if (flag == 1) {
                    count = Integer.parseInt(freqwordsp[1]);
                }
                count = count + 1;
                word = word + " " + count + "\n";
                freqword.add(word);

                System.out.println(freqword);
                writer = new BufferedWriter(new FileWriter("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt"));
                writer.write(String.valueOf(freqword));
            }
        }
        // writer.close();
        System.out.println(countWord);
        System.out.println("-------------- End Contents of file: ---------------------");
        long endTime = System.nanoTime();
        long totalTime = (endTime - startTime);
        System.out.println(TimeUnit.MINUTES.convert(totalTime, TimeUnit.NANOSECONDS));

        // note that Scanner suppresses exceptions
        if (sc.ioException() != null) {
            throw sc.ioException();
        }
    } finally {
        if (inputStream != null) {
            inputStream.close();
        }
        if (sc != null) {
            sc.close();
        }
    }

}

}

What is not working? What did you expect to happen and what happened instead? — Erwin Bolwidt
I expected that freq3.txt will contains all unique words and their frequency with a space in between and each word in a newline. But, the entire file is empty. — Rahat Ahmed

ilinykhma ilinykhma · Accepted Answer · 2019-03-01T06:25:23

First of all:

for each spitted word,read whole frequency file freq3.txt

Don't do it! Disk IO operations are very very slow. Do you have enought memory to read the file into memory? It seems, yes:

String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
Collections.addAll(freqword, freq.split("\\r?\\n"));

If you really need this file then load it once and work with memory. Also in this case the Map (word to frequency) may be more comfortable than the List. Save the collection on disk when the calculations are done.

Next, you could to bufferize your input stream, it may significally improve perfomance:

inputStream = new BufferedInputStream(new FileInputStream(path));

And don't forget to close the stream/reader/writer. Explicitly or by using the try-with-resource statement.

Generally speaking, the code may be simplified depending on the used API. For example:

public class DemoApplication {

    public static final String UTF8_BOM = "\uFEFF";

    private static String removeUTF8BOM(String s) {
        if (s.startsWith(UTF8_BOM)) {
            s = s.substring(1);
        }
        return s;
    }

    private static final String PATH = "words.txt";

    private static final String REGEX = " ";

    public static void main(String[] args) throws IOException {

        Map<String, Long> frequencyMap;
        try (BufferedReader reader = new BufferedReader(new FileReader(PATH))) {
            frequencyMap = reader
                    .lines()
                    .flatMap(s -> Arrays.stream(s.split(REGEX)))
                    .map(DemoApplication::removeUTF8BOM)
                    .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
        }

        frequencyMap
                .entrySet()
                .stream()
                .sorted(Comparator.comparingLong(Map.Entry::getValue))
                .limit(20)
                .forEach(System.out::println);
    }
}

Word Frequency of each word in 2GB txt file in UTF-8 Encoding in Java

1 Answers