I think you can go any of these two routes based on your application:
A. Ignore Header
If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1
parameter in your import command. Something like:
bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'
B. Load Raw Data
If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.
So, first create a table with a single column:
bq mk --table mydataset.rawlogstable 'data:STRING'
Now load your dataset in the table providing appropriate location:
bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'
Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.
C. Create External Table
If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:
bq --location=US mk --external_table_definition=data:STRING@CSV=gs://mybucket/path/* mydataset.rawlogstable
Querying Data
If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.
In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.
Let's say your rows should have 3 columns named a,b and c:
a1,b1,c1
a2,b2,c2
Right now its all in the form of a singular column named data
, which you can separate by the delimiter ,
:
select
splitted[safe_offset(0)] as a,
splitted[safe_offset(1)] as b,
splitted[safe_offset(2)] as c
from (select split(data, ',') as splitted from `mydataset.rawlogstable`)
Hope it helps.