2
votes

I would like to split a table with 120 million rows into multiple equally-sized tables in Google BigQuery. I found this answer to a question that asks about getting row numbers for a table. The example solution was:

SELECT
  ROW_NUMBER() OVER() row_number,
  contributor_username,
FROM (
  SELECT contributor_username,
  FROM [publicdata:samples.wikipedia]
  GROUP BY contributor_username
)

which would allow you to get a row_number, and then you could create a table by manually selecting

WHERE row_number BETWEEN x AND y

This works for the table in the example, but using ROW_NUMBER() yields a resources exceeded error for a table with 117MM rows.

I then tried the following idea:

  SELECT field1, field2
  FROM (
      SELECT field1, field2, ntile(100) over (order by row_id ) as ntile
      FROM (
        SELECT  hash(some_unique_key) as row_id, field1, field2
        FROM table_with_177_MM_rows
      )
  )
  WHERE ntile = 1

I would then need to run this for ntile=[1,...,100]. This also failed. Is there any way to split a large table with a unique key in GBQ into smaller tables?

3

3 Answers

2
votes

I tried below on table with 500 million rows as well as on table with 3 billion rows and it worked as expected

First you create new table (temp_table) with extra field rnd

SELECT
  field1, field2,
  RAND() AS rnd
FROM YourBigTable

Than you run something like below - depends on how many tables you want to split to - as many times as needed.

SELECT 
  field1, field2 
FROM temp_table
WHERE rnd >= 0.3 AND rnd < 0.4

of course you should make sure you set destination tables respectively

You delete temp_table after all done

Hard to say if it is better way or not though. Up to you!

1
votes

Similar to my second example above, but without using the ntile window function:

  SELECT field1, field2
  FROM (
        SELECT hash(some_unique_key) as row_id, field1, field2
        FROM table_with_177_MM_rows
  )
  WHERE abs(row_id) % {n_tables} = {table_id}

For n_tables = 10, you would run for table_id={0,1,2,...9} to get each table. Maybe there's a better way?

-1
votes

Similar to the above, but all in one single query (note: these kinds of queries sometimes fail with internal error and you have to re-run them. That's a separate question). The query below is easier to generate via code.

SELECT field1, field2 FROM ( SELECT field1, field2 FROM table_with_177_MM_rows WHERE hash(some_unique_key) % {n_tables} = {table_id_1} ), ( SELECT field1, field2 FROM table_with_177_MM_rows WHERE hash(some_unique_key) % {n_tables} = {table_id_2} ), .... ( SELECT field1, field2 FROM table_with_177_MM_rows WHERE hash(some_unique_key) % {n_tables} = {table_id_n} ),