0
votes

I have a table with around 2 Milliones Rows The point is that when I make a left join with other table that has 60 rows the query get extremely slow. I am doing a report in Data Studio and this was not the first time that after a Join in Bigquery the report is no more useful. Every time change a parameter is taking more than 40 seconds or a 1 minute when the table is joined. If the table is not joined the query take between 6 or 8 seconds. Like a normal query. I don't know if the problem is in Data Studio or in BigQuery. Could someone help me? Because now is imposible to make a dashboard in Data Studio with a Join using Bigquery

Here both queries: Without Join

SELECT
Tag_Id,
Image_Id,
Stream,
Tagging_Worker_Id,
Tagging_Worker_Name,
Tagging_Task_Id,
CASE WHEN Tagging_Time_Per_Tag > 200 THEN 200 ELSE Tagging_Time_Per_Tag END AS Tagging_Time_Per_Tag,
Tagging_Date,
Tagging_Class_Name,
Tagging_Class_Id,
Tagging_Template_Id,
Tagging_Top,
Tagging_Left,
Tagging_Width,
Tagging_Height,
Is_Tag_Adjusted,
Is_Class_Adjusted,
CASE WHEN (Is_Tag_Adjusted+Is_Class_Adjusted > 0) THEN 1 ELSE 0 END AS TagsAdjusted
FROM Stats.TaggingStats
where Tagging_Date>=  '2018-10-01'

With Join

SELECT
  st.Tag_Id,
  st.Image_Id,
  st.Stream,
  st.Tagging_Worker_Id,
  st.Tagging_Worker_Name,
  st.Tagging_Task_Id,
  st.Tagging_Time_Per_Tag,
  st.Tagging_Date,
  st.Tagging_Class_Name,
  st.Tagging_Class_Id,
  st.Tagging_Template_Id,
  st.Tagging_Top,
  st.Tagging_Left,
  st.Tagging_Width,
  st.Tagging_Height,
  st.Is_Tag_Adjusted,
  st.Is_Class_Adjusted,
  st.TagsAdjusted,
  CASE
    WHEN (sal.Type_Salary=2 AND (st.Is_Tag_Adjusted=1 OR st.Is_Tag_Adjusted=1)) THEN 0
    WHEN sal.Type_Salary=1 THEN st.Tagging_Time_Per_Tag*sal.Salary_Per_Second
    WHEN sal.Type_Salary=2 AND st.Is_Tag_Adjusted=0 AND st.Is_Tag_Adjusted=0 THEN 3
    ELSE st.Tagging_Time_Per_Tag
  END AS CostPerTag,
  CASE
    WHEN sal.Type_Salary IS NULL THEN 'Workers Without Costing'
    WHEN Type_Salary=1 THEN 'Workers With Salary Per Hour'
    WHEN Type_Salary=2 THEN 'Workers With Fixed Price Per Tag'
    ELSE 'Error'
  END AS Costing_Method
FROM (
  SELECT
    Tag_Id,
    Image_Id,
    Stream,
    Tagging_Worker_Id,
    Tagging_Worker_Name,
    Tagging_Task_Id,
    CASE
      WHEN Tagging_Time_Per_Tag > 200 THEN 200
      ELSE Tagging_Time_Per_Tag
    END AS Tagging_Time_Per_Tag,
    Tagging_Date,
    Tagging_Class_Name,
    Tagging_Class_Id,
    Tagging_Template_Id,
    Tagging_Top,
    Tagging_Left,
    Tagging_Width,
    Tagging_Height,
    Is_Tag_Adjusted,
    Is_Class_Adjusted,
    CASE
      WHEN (Is_Tag_Adjusted+Is_Class_Adjusted > 0) THEN 1
      ELSE 0
    END AS TagsAdjusted
  FROM
    Stats.TaggingStats) st
LEFT JOIN
  Stats.Salary sal
ON
  sal.Tagging_Worker_Id=st.Tagging_Worker_Id
WHERE
  Tagging_Date>= '2018-10-01'

But now after looking some numbers I am more confused than before. Running manually the queries, I can see that all the different variations take less than 20 seconds. But anyway, there are a lot of strangers things and the report in Data Studio is no useful. Is taking more than one minute every time I change a parameter. I tried to put all here with more details.

First Conclusion Doing some test in the times I can say that the queries with the CASE clause are not slower than the ones that don’t have the CASE clauses. In fact some times, are taking less time.

Second Conclusion When I query all the data of table A (2 Milliones Row) the times are almost the same that when I query all the data of the Table A and make a left join with the Table B(60 Rows). So, for the entire table the times with or without join are almost the same and is a lot only for 400 MB. The difference is only visible when I use the Where clause for one month.

Third Conclusion The google data studio report need 90 seconds to retrieve the data after I change a parameter. Google Data Studio is generating automatically different queries to complete the data in the parameters, scorecards and the chart. Every time I change a parameter, Google Data Studio is sending 6 different queries to retrieve this data. I analyzed the six queries and I put in this google document with the time of each one running manually in the Console. https://docs.google.com/document/d/1z_y5CqJW-LrLY5YyLXjSrc455RLUbklPxhnKmAS5zFk/edit?usp=sharing Each one of these six queries is taking (2.5 seconds approx). I copy them and I run manually each one. So, in the case that they need to process one after one the total of seconds has to be around 12 Seconds. So, this appear to be a problem of Google Data Studio retrieve the data from BigQuery. Is taking more than one minute. Impossible to offer this to a client. Before I made the Join I was working with the dashboard with a reasonable time response. But the JOIN appear to be killing Google Data Studio. I don’t know.

I am leaving here the Numbers that support the conclusion 1 and 2 https://docs.google.com/document/d/1sc3qjVpQrETofIgToJPIhZVs9HTjHsjTcrZvbt2NYcI/edit?usp=sharing With all the queries and variations to analyze the impact of the join, case clause and where clause.

The Query Before the JOIN The Query Before the JOIN and a Where Clause for This Month The Query with the JOIN and a Where Clause for This Month The Query with the JOIN for all the TABLE The Query with the JOIN and a Where Clause for This Month and the CASE clauses The Query with the JOIN for all the TABLE and the CASE clauses

Other issues: Discrepancy: Running manually using the Big Query console I found a lot of Discrepancy in the time to run for the same Query. I run the same query three times.

RUN 1) Query complete (12.955 sec elapsed, 490.83 MB processed)

RUN 2) Query complete (20.782 sec elapsed, 490.83 MB processed)

RUN 3) Query complete (10.814 sec elapsed, 490.83 MB processed)

1

1 Answers

3
votes

You are joining two table that create a broadcast join which sends all the data in to one slot. Plus you are doing a lot of computation (CASE). All this together is the reason why the query take way longer. I recommend to reduce the data before the join and/or to materialize the data.

To have a better understanding on how BigQuery works you can review this link.