What are you trying to achieve?
If you're sending "unique" events to the HEC, or you're running UFs on "unique" logs, you'll never get duplicate "records when indexing".
It sounds like you (perhaps routinely?) resend the same data to your aggregation platform - which is not a problem with the aggregator, but with your sending process.
Almost like you're doing a MySQL/PostgreSQL "insert if not exists" operation. If that is a correct understanding of your situation, based on your statement
We currently use "document id" in ElasticSearch to deduplicate records when indexing:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.
then you need to evaluate what is going "wrong" in your sending process that you feel you need to pre-clean the data before ingesting it.
It is true that Splunk won't "deduplicate records when indexing" - because it presumes the data coming-in to be 'correct' from whatever is submitting it.
How are you getting duplicate data in the first place?
Fields in Splunk which begin with the underscore (eg _time
, _cd
, etc) are not editable/sendable - they're generated by Splunk when it receives data. IOW, they're all internal fields. Searchable. Usable. But not overrideable.
If you really have a problem with [lots of/too much] duplicate data, and there is no way to fix your sending process[es], then you'll need to rely on deduplication operations in SPL when searching for/reporting on whatever you've ingested (primarily by using stats
and, when absolutely necessary/unavoidable, dedup
).