0
votes

My question is how much data are we allowed to process on bigquery. I am using stackoverflow's kaggle dataset to analyze the data, and the text I am analyzing is around 27gb. I just want to get the average length per entry, so I do

query_length_text = """
    SELECT 
        AVG(CHAR_LENGTH(title)) AS avg_title_length,
        AVG(CHAR_LENGTH(body)) AS avg_body_length
    FROM
        `bigquery-public-data.stackoverflow.stackoverflow_posts`
"""

however this says:

Query cancelled; estimated size of 26.847077486105263 exceeds limit of 1 GB

I am only returning one float, so i know that isn't the problem. Is the 1gb on the processing too? How do I process it in batches, so I do 1gb at a time?

1
Did you set a custom quota for yourself? Or did your project administrator set one? You need to remove or override the 1GB limit that you set. - Elliott Brossard
I now see that Kaggle has a 1gb limit, which can be overriden by using the max_gb_scanned parameter. thanks! - Rafay Kalim
I'm glad you figured it out! Consider adding an answer with the steps that you took in case someone else has the same problem in the future. - Elliott Brossard

1 Answers

3
votes

So Kaggle by default sets a 1GB limit on requests (to prevent your monthly quota of 5TB to run out). This is what causes this to happen. To prevent this, you can override it by using the max_gb_scanned parameter like this:

df = bq_assistant.query_to_pandas_safe(QUERY, max_gb_scanned = N)

where N is the amount of data processed by your query, or any number higher than it.