2
votes

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.

But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.

Could someone please help me to explain why this is happening.

1
Conversion of String data type is not given the above documentation.... could you please help to understand - Rush
It looks like the STRING logical type is being ignored currently, but you could file a feature request. - Elliott Brossard
Raised a feature request - issuetracker.google.com/119521327 - Rush

1 Answers

0
votes

That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.

I was facing similar issue when trying to move the table between Cloudera and BigQuery.

First creating the table as external on Impala like:

CREATE EXTERNAL TABLE test1
  STORED AS PARQUET
  LOCATION 's3a://table_migration/test1'
  AS select * from original_table

original_table has columns with STRING datatype

Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.

And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.

Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.

Finally I tried to do the same in Hive, and that helped.

So I ended up creating external table in Hive like:

CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
  STORED AS PARQUET
  LOCATION 's3a://table_migration/test2';

insert into table test2 select * from original_table;

And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.