7
votes

I'm reading a file which conatins 500000 rows. I'm testing to see how multiple thread speed up the process....

private void multiThreadRead(int num){

    for(int i=1; i<= num; i++) { 
        new Thread(readIndivColumn(i),""+i).start(); 
     } 
}

private Runnable readIndivColumn(final int colNum){
    return new Runnable(){
        @Override
        public void run() {
            // TODO Auto-generated method stub
            try {

                long startTime = System.currentTimeMillis();
                System.out.println("From Thread no:"+colNum+" Start time:"+startTime);

                RandomAccessFile raf = new RandomAccessFile("./src/test/test1.csv","r");
                String line = "";
                //System.out.println("From Thread no:"+colNum);

                while((line = raf.readLine()) != null){
                    //System.out.println(line);
                    //System.out.println(StatUtils.getCellValue(line, colNum));
                }


                long elapsedTime = System.currentTimeMillis() - startTime;

                String formattedTime = String.format("%d min, %d sec",  
                        TimeUnit.MILLISECONDS.toMinutes(elapsedTime), 
                        TimeUnit.MILLISECONDS.toSeconds(elapsedTime) -  
                        TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(elapsedTime)) 
                    );

                System.out.println("From Thread no:"+colNum+" Finished Time:"+formattedTime);
            } 
            catch (Exception e) {
                // TODO Auto-generated catch block
                System.out.println("From Thread no:"+colNum +"===>"+e.getMessage());

                e.printStackTrace();
            }
        }
    };
}

private void sequentialRead(int num){
    try{
        long startTime = System.currentTimeMillis();
        System.out.println("Start time:"+startTime);

        for(int i =0; i < num; i++){
            RandomAccessFile raf = new RandomAccessFile("./src/test/test1.csv","r");
            String line = "";

            while((line = raf.readLine()) != null){
                //System.out.println(line);
            }               
        }

        long elapsedTime = System.currentTimeMillis() - startTime;

        String formattedTime = String.format("%d min, %d sec",  
                TimeUnit.MILLISECONDS.toMinutes(elapsedTime), 
                TimeUnit.MILLISECONDS.toSeconds(elapsedTime) -  
                TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(elapsedTime)) 
            );

        System.out.println("Finished Time:"+formattedTime);
    }
    catch (Exception e) {
        e.printStackTrace();
        // TODO: handle exception
    }

}
    public TesterClass() {

    sequentialRead(1);      
    this.multiThreadRead(1);

}

for num = 1 I get following result:

Start time:1326224619049

Finished Time:2 min, 14 sec

Sequential read ENDS...........

Multi-Thread read starts:

From Thread no:1 Start time:1326224753606

From Thread no:1 Finished Time:2 min, 13 sec

Multi-Thread read ENDS.....

for num = 5 I get following result:

    formatted Time:10 min, 20 sec

Sequential read ENDS...........

Multi-Thread read starts:

From Thread no:1 Start time:1326223509574
From Thread no:3 Start time:1326223509574
From Thread no:4 Start time:1326223509574
From Thread no:5 Start time:1326223509574
From Thread no:2 Start time:1326223509574
From Thread no:4 formatted Time:5 min, 54 sec
From Thread no:2 formatted Time:6 min, 0 sec
From Thread no:3 formatted Time:6 min, 7 sec
From Thread no:5 formatted Time:6 min, 23 sec
From Thread no:1 formatted Time:6 min, 23 sec
Multi-Thread read ENDS.....

My question is: shouldn't multi-threaded read takes approx. 2.13 sec ? Can you please explain why is it taking too long with multi-threaded solution?

Thanks in advance.

3
Threading will not work unless they write to a different disk, in that case both threads are contending to write on to the same file. Hence, threading will not work in that scenario.user982733
@TomaszNurkiewicz - Not the same, that one was using one thread per file.Kelly S. French

3 Answers

10
votes

The reason you are seeing a slow down when reading in parallel is because the magnetic hard disk head needs to seek the next read position (taking about 5ms) for each thread. Thus, reading with multiple threads effectively bounces the disk between seeks, slowing it down. The only recommended way to read a file from a single disk is to read sequentially with one thread.

8
votes

Since file reading is mainly waiting for disk I/O, you have the problem that the disk won't spin faster just because it's used by many threads :)

1
votes

Reading from a file is an inherently serial process, assuming no caching, meaning there is a limit to how fast you can retrieve data from a file. Even without file locks (i.e. opening the file read-only) all the threads after the 1st will just block on the disk read so you make all the other threads wait and whichever one is active when the data becomes available is the one that processes the next block.