4
votes

I'm exporting data from Firebase to BigQuery but figures shown in Firebase are different to those I extract manually in BigQuery when I COUNT(DISTINCT) users with USER_ENGAGEMENT event.

In Firebase, I defined an "All users" Audience. In the relative Audience dashboard, Activity graph (that shows active users day by day vs previous month) I see that on 1st April I had 388.597 users.

I have a standard live extraction from Firebase to BigQuery that integrates event data and I would like to find the same number here.

This is the query I use in BigQuery

SELECT
  PARSE_DATE('%Y%m%d', event_date) AS Date
  , COUNT(DISTINCT user_pseudo_id ) AS Active_Users
FROM
  `xxxx.analytics_162988291.events_*`
WHERE FORMAT_DATE('%Y-%m', PARSE_DATE('%Y%m%d', event_date)) = FORMAT_DATE('%Y-%m', DATE_ADD(CURRENT_DATE(), INTERVAL -EXTRACT(DAY FROM CURRENT_DATE()) DAY))
  and event_name = 'user_engagement'
  GROUP BY Date
  ORDER BY Date;

For the 1st April, BigQuery returns 391394 distinct users, which is 2797 more than Firebase.

I found here Firebase vs BigQuery Active Users Discrepancies that this might be due to timezone settings. My project is set as GMT+1 while BigQuery uses UTC to name the tables. However in my query I'm already grouping by EVENT_DATE. To be sure, I anyway compared day-by-day the 2 figures for the month of April. here are the figures:

Firebase         BigQuery       Difference
 388 597         391 394        -2 797    
 378 393         374 023         4 370    
 372 890         374 601        -1 711    
 364 520         366 367        -1 847    
 375 903         374 479         1 424    
 453 630         452 077         1 553    
 476 413         476 401         12    
 398 481         396 227         2 254    
 378 338         375 531         2 807    
 368 699         369 699        -1 000    
 363 063         363 436        -373    
 369 874         368 441         1 433    
 460 472         454 647         5 825    
 481 832         487 214        -5 382    
 395 781         392 564         3 217    
 370 940         366 189         4 751    
 122 434         122 549        -115    
 342 249         344 086        -1 837    
 371 934         367 609         4 325    
 391 817         391 765         52    
 390 550         391 367        -817    
 412 276         408 876         3 400    
 377 219         379 659        -2 440    
 370 611         366 901         3 710    
 367 272         365 267         2 005    
 365 307         369 399        -4 092    
 438 676         439 990        -1 314    
 471 515         471 363         152    
 392 256         395 532        -3 276    
 369 500         368 416         1 084   
Total 
 11 581 442      11 566 069      15 373

If it was a timezone problem, I would have argued that the total difference should not have been too different than any other day, but the difference is instead more significant.

Also, as I'm using StandardSQL, COUNT(DISTINCT) should be accurate.

At this point, I suspect Firebase is defining "User Activity" as something different, but I fail to understand how.

Help!

EDIT

By reading Firebase -> BigQuery how to get active users for that month, week, day I started to notice how people are using a different grouping field than the one I use (user_pseudo_id), however it doesnt look like my table events_* has other app_info.* fields than

app_info.id STRING
app_info.version    STRING
app_info.install_store STRING
app_info.firebase_app_id    STRING
app_info.install_source STRING

could this be just a grouping problem?

1
Same issue here. user_pseudo_id seems to overcount the active users by a large amount. I wonder what's the right field to look at for computing DAU with Firebase.user1615898

1 Answers

0
votes

I believe the answer is hidden behind this sentence in Firebase documentation: "...active users for the date range, including fluctuation by percentage from the previous date range." (see the reference here https://support.google.com/firebase/answer/6317517?authuser=0#active-users). It means that Firebase first counts the active users exactly as you do but then adds some number based on the fluctuation of users in previous days.