1
votes

Goal

I have a mutable Map[Long, Long] with millions of entries. I need to make many iterations of updates with millions of updates. I would like to do this as fast as possible.

Background

Currently, the fastest method is to use a single threaded mutable.LongMap[Long]. This type is optimized for Long types as the key.

Other map types appear to be slower -- but I may have implemented them incorrectly as I was trying to do the updates concurrently and/or in parallel without success. It is possible that updating a map in parallel is not actually occurring or is not possible in Scala.

In order of fastest to slowest:

  1. LongMap[Long] (from above)
  2. TrieMap[Long, Long]
  3. ParTrieMap[Long, Long]
  4. HashMap[Long, Long]
  5. ParHashMap[Long, Long]
  6. ParMap[Long, Long]

It is OK if a faster method is not mutable, but I do not think this will be the case. A mutable map is probably best for this use case.

Code to generate test data and time the test

import java.util.Calendar
import scala.collection.mutable

object DictSpeedTest2 {

  //helper constants
  val million: Long = 1000000
  val billion: Long = million * 1000

  //config
  val garbageCollectionWait = 3
  val numEntries: Long = million * 10 //may need to increase JVM memory with something like: -Xmx32g
  val maxValue: Long = billion * million // max Long = 9223372036854775807L
                                         // this is       1000000000000000L

  def main(args: Array[String]): Unit = {
    //generate random data; initial entries in a; updates in b
    val a = genData(numEntries, maxValue, seed = 1000)
    val b = genData(numEntries, maxValue, seed = 9999)

    //initialization
    val dict = new mutable.LongMap[Long]()
    a.foreach(x => dict += (x._1 -> x._2))

    //run and time test
    println("start test: " + Calendar.getInstance().getTime)
    val start = System.currentTimeMillis
    b.foreach(x => dict += (x._1 -> x._2)) //updates
    val end = System.currentTimeMillis

    //print runtime
    val durationInSeconds = (end - start).toFloat / 1000 + "s"
    println("end test:  " + Calendar.getInstance().getTime + " -- " + durationInSeconds)
  }

  def genData(n: Long, max: Long, seed: Long): Array[(Long, Long)] = {
    val r = scala.util.Random
    r.setSeed(seed) //deterministic generation of arrays
    val a = new Array[(Long, Long)](n.toInt)
    a.map(_ => (r.nextInt(), r.nextInt()) )
  }
}

Current timings

LongMap[Long] with the above code completes in the following times on my 2018 MacBook Pro:

  • ~3.5 seconds with numEntries = 10 million
  • ~100 seconds with numEntries = 100 million
1
First this is far for being an appropriate way to test the performance of two implementations. Take a look to proper benchmark tools like ScalaMeter. Second, mutability + concurrency / parallelism will lead to many headaches. It may help describing what are you trying to model here and why performance is too important for you in this use case.Luis Miguel Mejía Suárez
Still you haven't answered what is the purpose of this mutable map and why performance is too important.Luis Miguel Mejía Suárez
Have you considered moving that state outside your app. For example to a database, for example Mongo? Or even better to a distributed cache like Redis? Or maybe using other technologies to process that amount of data, like Spark or Scio?Luis Miguel Mejía Suárez
Yeah, I was referring more to the ability of keeping all that mutations outside your code, sometimes you can optimize the algorithm in different ways. Scio would not work for you if you use AWS as it integrates well with Dataflow service in GCP. PS: You may also see if streaming would help, take a look to fs2, Monix or Akka Streams.Luis Miguel Mejía Suárez
I am very familiar with Akka Actors and have dabbled with the streams library. I really want to keep the state in this module. Going outside of the app would slow things down and require pulling some state back in. Latency matters here. I imported FastUtil and received an almost instantaneous 8x speed increase. That will be good enough for now.AetherMass

1 Answers

2
votes

If you are not limited to use only Scala/Java maps than for exceptional performance you can peek 3rd party libraries that have maps specialized for Long/Long key/value pairs.

Here is not so outdated overview of such kind of libraries with benchmark results for Int/Int pairs.