1
votes

I am trying to validate data streaming events into BigQuery by cross checking them with Mixpanel. The data in BigQuery, however, is always more for each type of event we are streaming into than Mixpanel. I thought this was a duplication issue, but the times are different for each event within BigQuery. The only issue I can see that might be causing the difference is the streaming insert having a significant lag, making certain events not show up in the table for up to an hour. If anyone can give me insight to this issue I would appreciate it. To clarify:

  1. I am validating the BigQuery data by looking at how many events are streaming in per day.

  2. The difference is somewhat small, for example for a particular day Mixpanel sees 634 events while BigQuery is seeing 703 events.

  3. I have already taken into account the timezone difference, as Mixpanel gives the events in your current time zone and my company stores events in UTC.

1

1 Answers

1
votes

If you are retrying on failed jobs there is a possibility that the jobs reported as failed are succeeding and creating duplicates.

You can mitigate this by supplying a unique insertId in the streaming job and Google will perform best effort de-duplication.

When you reference a different time for each event, are you referencing a column present in your dataset or the creation_time column?