0
votes

I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.

However, while creating the table in BigQuery Dataset I am getting

Invalid field name "k8s-app". 
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. 
Table: bq_table

A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.

What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?

Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.

* EDIT *

Please refer to this documentation for clear understanding

4
I assume the search tools in Stackdriver Logging | Log Viewer do not do what you need to do that is why you exported the data and are looking for a different way to query the data? Just wanted to make sure before I tried to provide an answer. Thanks. - AlphaPapa
Yes, they do not allow to modify log type or any log data. Logs can only be deleted - Amit Yadav

4 Answers

3
votes

I think you can go any of these two routes based on your application:

A. Ignore Header

If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:

  bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'

B. Load Raw Data

If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.

So, first create a table with a single column:

bq mk --table mydataset.rawlogstable 'data:STRING'

Now load your dataset in the table providing appropriate location:

bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'

Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.

C. Create External Table

If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:

bq --location=US mk --external_table_definition=data:STRING@CSV=gs://mybucket/path/* mydataset.rawlogstable

Querying Data

If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.

In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.

Let's say your rows should have 3 columns named a,b and c:

 a1,b1,c1
 a2,b2,c2

Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:

 select 
    splitted[safe_offset(0)] as a, 
    splitted[safe_offset(1)] as b,
    splitted[safe_offset(2)] as c
 from (select split(data, ',') as splitted from `mydataset.rawlogstable`)

Hope it helps.

1
votes

To expand on @khan's answer:

  • If the files are JSON, then you won't be able to use the first method (skip headers).

  • But you can load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery

Find a full example for loading rows raw at:

And then you can use JSON_EXTRACT_SCALAR to parse JSON in BigQuery - and transform the existing field names into BigQuery compatible ones.

0
votes

Unfortunately no!

As part of log analytics, it is common to reshape the log data and run few ETL's before the files are committed to a persistent sink such as BigQuery.

If performance monitoring is all you need for log analytics, and there is no rationale to create additional code for ETL, all metrics can be derived from REST API endpoints of stackdriver monitoring.

0
votes

If you do not need fields containing - you can set up to ignore ignore_unknown_values. You have to provide the schema you want and using ignore_unknown_values any field not matching the schema will be ignored.