How to avoid data duplicates in ClickHouse

Question

I already read this but I still have questions. I only have one VM with 16 GB of RAM, 4 cores and a disk of 100 GB, with only ClickHouse and a light web api working on it.

I'm storing leaked credentials in a database:

CREATE TABLE credential (
    user String,
    domain String,
    password String,
    first_seen Date,
    leaks Array(UInt64)
) ENGINE ReplacingMergeTree
PARTITION BY first_seen
ORDER BY user, domain, password, first_seen

It something happens that some credentials appear more than once (inside a file or between many).

My long-term objective is(was) the following: - when inserting a credential which is already in the database, I want to keep the smaller first_seen and add the new leak id to the field leaks.

I have tried the ReplacingMergeTree engine, insert twice the same data ($ cat "data.csv" | clickhouse-client --query 'INSERT INTO credential FORMAT CSV') and then performed OPTIMIZE TABLE credential to force the replacing engine to do its asynchronous job, according to the documentation. Nothing happens, data is twice in the database.

So I wonder:
- what did i miss with the ReplacingMergeTree engine ? - how does OPTIMIZE work and why doesn't it do what I was expecting from it ?
- is there a real solution for avoiding replicated data on a single instance of ClickHouse ?

I have already tried to do it manually. My problem is a have 4.5 billions records into my database, and identifying duplicates inside a 100k entries sample almost takes 5 minutes with the follow query: SELECT DISTINCT user, domain, password, count() as c FROM credential WHERE has(leaks, 0) GROUP BY user, domain, password HAVING c > 1 This query obviously does not work on the 4.5b entries, as I do not have enough RAM.

Any ideas will be tried.

RoyB RoyB · Accepted Answer · 2019-07-03T10:36:52

Multiple things are going wrong here:

You partition very granulary... you should partition by something like a month of data, whatsoever. Now clickhous has to scan lots of files.
You dont provide the table engine with a version. The problem here is, that clickhouse is not able to find out wich row should replace the other. I suggest you use the "version" parameter of the ReplacingMergeTree, as it allows you to provide an incremental version as a number, or if this works better for you, the current DateTime (where the last DateTime always wins)
You should never design your solution to require OPTIMIZE be called to make your data consistent in your result sets, it is not designed for this.

Clickhouse always allows you to write a query where you can provide (eventual) consistency without using OPTIMIZE beforehand.

Reason for avoiding OPTIMIZE, besides being really slow and heavy on your DB, you could end up in race conditions, where other clients of the database (or replicating clickhouse nodes) could invalidate your data between the OPTIMIZE finished and the SELECT is done.

Bottomline, as a solution: So what you should do here is, add a version column. Then when inserting rows, insert the current timestamp as a version. Then select for each row only the one that has the highest version in your result so that you do not depend on OPTIMIZE for anything other then garbage collection.

How to avoid data duplicates in ClickHouse

1 Answers