0
votes

I have data flowing from Event hub to Azure data explorer. We configured a main table which holds all the data. Upon ingestion to that main table we create few sub tables which has partial data of the main table. For example.

Event hub we receive the following JSON data

{
TIME:'T'
data: [
{
  "ABC" : 1234,
  "DEF" : 1235,
  "GHI" : 1236
},
{
  "ABC" : 1223,
  "DEF" : 1236,
  "GHI" : 12
},
{
  "ABC" : 2213,
  "DEF" : 1235,
  "GHI" : 1236
}]
} 

Upon ingesting this data into main Raw table, we have few ingestion time policies which will expand this JSON and ingest into sub tables in the below table format.

TIMEINUTC ABC
T 1234
T 1223
T 2213

Similarly other tables as follows.

TIMEINUTC DEF
T 1235
T 1236
T 2213
TIMEINUTC GHI
T 1236
T 12
T 1236

Now I can run my analytical jobs/ Kusto queries based on this small tables.

The problem is, I want to make some conditional insertion to another table. For example. Say the value of ABC is 1234 And inserted into the ABC sub table. If I receive ABC value as 1234 again after some time I don't want to insert into the sub table.

  1. Is it a good approach to write such a logic in the ingestion time policy?
  2. If such ingestion time policy takes some time to execute will it affect the ingestion of RawData from Event hub to the main Raw table?
  3. Should I go ahead with some scheduler approach in periodical interval from ADF/Logic apps?
1

1 Answers

1
votes

I assume that by "ingestion time policy" you are referring to update policy. For the scenario that you are describing (removing duplicate records), a better approach is materialized-view, (check the example for "deduplication").

Regarding Q.2, this depends if the update policy is transactional, if it is, it will impact the ingestion to the raw table, since the ingestion will not commit until all update policies ended successfully, if its not transactional, then it will not have significant impact.