Collect stream with grouping, counting and filtering operations

Question

I'm trying to collect stream throwing away rarely used items like in this example:

import java.util.*;
import java.util.function.Function;
import static java.util.stream.Collectors.*;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.containsInAnyOrder;
import org.junit.Test;

@Test
public void shouldFilterCommonlyUsedWords() {
    // given
    List<String> allWords = Arrays.asList(
       "call", "feel", "call", "very", "call", "very", "feel", "very", "any");

    // when
    Set<String> commonlyUsed = allWords.stream()
            .collect(groupingBy(Function.identity(), counting()))
            .entrySet().stream().filter(e -> e.getValue() > 2)
            .map(Map.Entry::getKey).collect(toSet());

    // then
    assertThat(commonlyUsed, containsInAnyOrder("call", "very"));
}

I have a feeling that it is possible to do it much simpler - am I right?

You can replace new ArrayList<>(Arrays.asList(…)) by a simple Arrays.asList(…). There is only one way to avoid a map which is calculating the frequency again for each item but that’s O(n²) CPU complexity, so I guess you better live with the intermediate map… — Holger
What this question attempts to do is equivalent to the SQL statement SELECT word FROM allWords GROUP BY word HAVING count(*) > 2. The groupingBy Collector does the job of GROUP BY, but there is no HAVING clause equivalent. It would be good for Java to add that functionality, e.g. something like groupingBy(Function<? super T,? extends K> classifier, Collector<? super T,A,D> downstream, Predicate<? super D> having). — rgettman
@rgettman: I don’t see the advantage of such a method. It’s only hiding the fact that it has to collect the entire map first, before filtering just like in my answer… — Holger
@Holger Then why have a HAVING clause in SQL? It's just hiding the fact that the group by and aggregate operations happen before the aggregate values are filtered. Once could write SELECT word FROM (SELECT word, count(*) c FROM allWords GROUP BY word) WHERE c > 2;. Having HAVING allows more concise code. One can certainly use your solution; it will work. I just pointed out that it would be good for Java to have the HAVING option that would simplify the code. — rgettman

Holger Holger · Accepted Answer · 2015-05-28T20:05:51

There is no way around creating a Map, unless you want accept a very high CPU complexity.

However, you can remove the second collect operation:

Map<String,Long> map = allWords.stream()
    .collect(groupingBy(Function.identity(), HashMap::new, counting()));
map.values().removeIf(l -> l<=2);
Set<String> commonlyUsed=map.keySet();

Note that in Java 8, HashSet still wraps a HashMap, so using the keySet() of a HashMap, when you want a Set in the first place, doesn’t waste space given the current implementation.

Of course, you can hide the post-processing in a Collector if that feels more “streamy”:

Set<String> commonlyUsed = allWords.stream()
    .collect(collectingAndThen(
        groupingBy(Function.identity(), HashMap::new, counting()),
        map-> { map.values().removeIf(l -> l<=2); return map.keySet(); }));

Collect stream with grouping, counting and filtering operations

3 Answers