NOT IN Subquery slow and runs out of memory (Clickhouse)

Question

I have a single table holding DNA variants for different people. I want to show the variants that are unique to a person:

Table DNA (engine ordered by variant):

   person | variant
   p1     | v1 
   p1     | v2 
   p1     | v3 
   p2     | v2 
   p2     | v3 
   p3     | v2 
   p3     | v3
   p4     | v2 
   p4     | v3

So a simple query:

   select variant from DNA where person = 'p1' and variant 
     not in (select variant from DNA where person in ('p2', 'p3'))

will return all variants unique to p1 vs. p2 and p3 (p4 not considered for this query). However - it is slow and runs out of memory.

Should I be doing this a different way?

Hadley Is My Hero Hadley Is My Hero · Accepted Answer · 2018-10-24T19:36:40

I suspect that the reason it is running out of memory is that the select variant from DNA where person in ('p2', 'p3') sub-query will result in v2, v3, v2, v3. This, especially when brought to scale, seems exceedingly inefficient because of the repetition. Potentially, adding distinct to the query may help, but in general this seems like an inefficient method of achieving your results if you have a lot of people (you'd have to manually type a lot of people in where person in (.........).

An alternative to this is to do a self join and basically limit the results to those where the only match is itself. Something like:

SELECT person, COUNT(*)
FROM (
    SELECT * FROM table
    ALL LEFT JOIN table
    USING variant
)
GROUP BY person
HAVING COUNT(*) == 1;

NOT IN Subquery slow and runs out of memory (Clickhouse)

1 Answers