9
votes

In this SO thread, I learned that keeping a reference to a seq on a large collection will prevent the entire collection from being garbage-collected.

First, that thread is from 2009. Is this still true in "modern" Clojure (v1.4.0 or v1.5.0)?

Second, does this issue also apply to lazy sequences? For example, would (def s (drop 999 (seq (range 1000)))) allow the garbage collector to retire the first 999 elements of the sequence?

Lastly, is there a good way around this issue for large collections? In other words, if I had a vector of, say, 10 million elements, could I consume the vector in such a way that the consumed parts could be garbage collected? What about if I had a hashmap with 10 million elements?

The reason I ask is that I'm operating on fairly large data sets, and I am having to be more careful not to retain references to objects, so that the objects I don't need can be garbage collected. As it is, I'm encountering a java.lang.OutOfMemoryError: GC overhead limit exceeded error in some cases.

2
I think @cgrand's example (drop 999990 (vec (range 1000000))) is due to the intervening vector and the behavior of subvectoring. I don't suspect a lazy consed sequence would do this. If you need to release a vector while retaining a subvector, you can copy the subvector into a new vector. Very interesting question though, I'm waiting to see the answers too! - A. Webb

2 Answers

7
votes

It is always the case that if you "hold onto the head" of a sequence then Clojure will be forced to keep everything in memory. It doesn't have a choice: you are still keeping a reference to it.

However the "GC overhead limit reached" isn't the same as an out of memory error - It's more likely a sign that you are running a fictitious workload that is creating and discarding objects so fast that it is tricking the GC into thinking that it is overloaded.

See:

If you put an actual workload on the items being processed, I suspect you will see that this error won't happen any more. You can easily process lazy sequences that are larger than available memory in this case.

Concrete collections like vectors and hashmaps are a different matter however: these are not lazy, so must always be held completely in memory. If you have datasets larger than memory then your options include:

  • Use lazy sequences and don't hold onto the head
  • Use specialised collections that support lazy loading (Datomic uses some structures like this I believe)
  • Treat the data as an event stream (using something like Storm)
  • Write custom code to partition the data into chunks and process them one at a time.
0
votes

If you hold the head of a sequence in a binding then you are correct and it can't be gc'd (and that's with every version of Clojure). If you are processing a large amount of results, why do you need to hold onto the head?

As for way around it, yes! lazy-seq implementation can gc parts that have already been 'processed' and are not being directly referenced from within the binding. Just ensure that you are not holding onto the head of the sequence.