Is it possible to query log data stored Cloud Storage without Cleaning it using BigQuery?

Question

I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.

However, while creating the table in BigQuery Dataset I am getting

Invalid field name "k8s-app". 
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. 
Table: bq_table

A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.

What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?

Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.

* EDIT *

Please refer to this documentation for clear understanding

I assume the search tools in Stackdriver Logging | Log Viewer do not do what you need to do that is why you exported the data and are looking for a different way to query the data? Just wanted to make sure before I tried to provide an answer. Thanks. — AlphaPapa
Yes, they do not allow to modify log type or any log data. Logs can only be deleted — Amit Yadav

khan khan · Accepted Answer · 2019-09-04T14:32:26

I think you can go any of these two routes based on your application:

A. Ignore Header

If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:

  bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'

B. Load Raw Data

If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.

So, first create a table with a single column:

bq mk --table mydataset.rawlogstable 'data:STRING'

Now load your dataset in the table providing appropriate location:

bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'

Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.

C. Create External Table

If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:

bq --location=US mk --external_table_definition=data:STRING@CSV=gs://mybucket/path/* mydataset.rawlogstable

Querying Data

If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.

In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.

Let's say your rows should have 3 columns named a,b and c:

 a1,b1,c1
 a2,b2,c2

Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:

 select 
    splitted[safe_offset(0)] as a, 
    splitted[safe_offset(1)] as b,
    splitted[safe_offset(2)] as c
 from (select split(data, ',') as splitted from `mydataset.rawlogstable`)

Hope it helps.

Is it possible to query log data stored Cloud Storage without Cleaning it using BigQuery?

4 Answers