2
votes

I am currently attempting to run the following SQL script in BigQuery (with the goal of saving the table out) but am getting a SQL processing error relating to the capacity after I start the query without a row return size limit. The error is the common one: "Error: Resources exceeded during query execution."

SELECT date, 
       Concat(fullvisitorid, String(visitid)) AS unique_visit_id, 
       visitid, 
       visitnumber, 
       fullvisitorid, 
       totals.pageviews, 
       totals.bounces, 
       hits.page.pagepath, 
       hits.page.pagetitle, 
       device.devicecategory, 
       device.browser, 
       device.browserversion, 
       hits.customvariables.index, 
       hits.customvariables.customvarname, 
       hits.customvariables.customvarvalue, 
       hits.time 
FROM   (Flatten([XXXXXXXX.ga_sessions_20140711], hits.time)) 
WHERE  hits.customvariables.index = 4 
ORDER  BY unique_visit_id DESC, 
          hits.time ASC 

The job ID that was provided for the job is: ua-web-analytics:job_60fxNnmo9gZ23kaji50v3qH9dXs. I have read the other posts on the topic of these errors such as this post which focuses on the resource errors observed completing a join. I suspect that the issue right now is with the use of FLATTEN, and am working through some different approaches. That said, I am concerned because, in future, this query may be run on 30 or 60 days together (versus just the single day that I am prototyping on right now) which will dramatically increase the data size to over 500GB to 1TB. The goal of the above query was to generate a table which I could save out and then operate on. Unfortunately, doing this in an Ad Hoc manner seems somewhat problematic. Has anyone else encountered resource constraints when using a similar SQL query? For context, the table that is being queried over is about 17.2 GB in size, with just over a million rows.

2
You need to set: Allow Large Results - Pentium10
Got it. Thanks! Key point I noticed is that a table needs to be specified in advance in order to use the "Allow Large Results" option flag! - Nathaniel Payne

2 Answers

5
votes

As @Pentium10 mentioned, setting allow large results will allow you to return the larger results from the flattened query. Usually the signal that you should use "allow large results" is that you see a "result too large" error.

However, there is another part of your query that is unparallelizable: the ORDER BY operation. Is this required? Usually, we've found that most of the time when ORDER BY is used on large tables, what people really want is an ORDER BY ... LIMIT (which can be done efficiently and in parallel). Or they are just adding the ORDER BY because it makes it easier to eyeball the results. If you can drop the ORDER BY it will likely make your query faster and scale better as the data size increases.

0
votes

I've been through exactly the same problem and I really wanted the ORDER BY, so Jordan Tigani's answer wasn't an option.

  1. Run the query without ORDER BY and save in a dataset table.

  2. Export the content from that table to a bucket in GCS using wildcard (BUCKETNAME/FILENAME*.csv)

  3. Download the files to a folder in your machine.

  4. Install XAMPP (if you get a UAC warning) and change some settings after.

  5. Start Apache and MySQL in your XAMPP control panel.

  6. Install HeidiSQL and stablish the connection with your MySQL server (installed with XAMPP)

  7. Create a database and a table with its fields.

  8. Go to Tools > Import CSV file, configure accordingly and import.

  9. Once all data is imported, do the ORDER BY and export the table.