0
votes

I'm using gpars to process a 250M row MySQL database table in parallel. I create 8 gpars threads, 8 independent database connections and divide the data in such a way that each thread operates independently on different ranges of rows...sort of a cheap MapReduce concept. At the core, the logic is like this:

  withExistingPool(pool)
  {
    connection_array.collectParallel()         
    {
       // Figure out which connection this thread can use. 
       // We use the index into the array to figure out 
       // which thread we are, and this tells us where to
       // read data.

       int i
       for (i = 0; i < connection_array.size(); i++)
          if (it == connection_array[i])
             break

       // Each thread runs the same query, with LIMIT controlling 
       // the position of rows it will read...if we have 8 threads
       // reading 40000 rows per call to this routine, each thread
       // reads 5000 rows (thread-0 reads rows 0-4999, thread-1 reads 
       // 5000-9999 and so forth). 

       def startrow = lastrow + (i * MAX_ROWS)
       def rows = it.rows( "SELECT * ... LIMIT ($startrow, $MAX_ROWS)")  

       // Add our rows to the result set we will return to the caller
       // (needs to be serialized since many threads can be here)

       lock.lock()
       if (!result) 
          result = rows
       else
          result += rows
       lock.unlock()
   }
 }

The code works great initially, giving me over 10,000 rows per second when it starts. But after a few million rows, it begins to slow down. By the time we're 25 million rows in, instead of 10,000 rows per second, we're only getting 1,000 rows per second. If we kill the app and restart it from the point we left off, it goes back to 10K rows per second again for a while, but it always slows down as processing continues.

There's plenty of processing power available - this is an 8-way system and the database is over a network, so there's a fair bit of wait time no matter what. The processors generally run no more than 25-30% busy while this is running. Also don't seem to be any memory leaks - we monitor memory statistics and don't see any change once processing is underway. The MySQL Server doesn't seem to be stressed (it runs about 30% busy initially, decreasing as the app slows down).

Are there any tricks to help this sort of thing run more consistently with large numbers of iterations?

2
It's probably because you're constantly resizing the result list. Have you tried specifying an initial size for result? You don't show how/where it's initialisedtim_yates
You can also probably use (0..<connection_array.size()).collectParallel instead of your array, then you won't need to search for the current indextim_yates
@tim - thanks for the suggestions...your first comment is a good one - since we know how many rows are being fetched, we're able to preallocate the result and pass it in as a parameter rather than dynamically building it each time. This helped performance almost 5%. The second suggestion proved to actually be slightly slower than our original approach - I guess it just doesn't take that long to search an 8-item array. And unfortunately, the original issue still exists...the routine gets slower and slower the more records it processes.Valerie R
Actually, you're doing collect (which will be building the results) and collecting the results yourself... Change to results = connection_array.collectParallel() and get rid of your last if block... Or try keeping it and change to eachParalleltim_yates

2 Answers

0
votes

Okay, we think found the issue - it seems like it has to do with opening the JDBC connection on a different thread than where it's used. By initially opening the connection on the thread where it will be used, and then ensuring that ONLY that thread accesses that connection, the performance issue disappeared.

We also reworked the logic a bit to use a cursor-based approach rather than multiple queries with LIMIT. There were reports that LIMIT with high start_row could be slow, but we didn't see a big difference just by making this change (cursors were faster, but performance still degraded as rows were processed).

Still, between this and some of the changes tim_yates suggested, we're running a good 30% faster than before - and now it's consistently fast no matter how many rows we process.

0
votes

LIMIT and OFFSET are not as efficient as most people would like.

When doing LIMIT 1000,20, 1000 rows will be read, but skipped, then 20 rows will be read and delivered. That is, as the OFFSET grows, the query gets slower.

The technique to "fix" this is to "remember where you left off". This is especially easy with an AUTO_INCREMENT primary key, but it can be done with any PRIMARY KEY or UNIQUE key.

This is discussed further in my "Pagination" blog. It is aimed at "Next" buttons on a web page, so some of the discussion can be ignored.