0
votes

Im attempting to stream data from a kafka installation into BigQuery using Java based on Google samples. The data is JSON rows ~12K in length. I batching these into blocks of 500 (roughly 6Mb) and streaming them as:

InsertAllRequest.Builder builder = InsertAllRequest.newBuilder(tableId);

for (String record : bqStreamingPacket.getRecords()) {
    Map<String, Object> mapObject = objectMapper.readValue(record.replaceAll("\\{,", "{"), new TypeReference<Map<String, Object>>() {});

    // remove nulls
    mapObject.values().removeIf(Objects::isNull);

    // create an id for each row - use to retry / avoid duplication
    builder.addRow(String.valueOf(System.nanoTime()), mapObject);
}

insertAllRequest = builder.build();

...


BigQueryOptions bigQueryOptions = BigQueryOptions.newBuilder().
    setCredentials(Credentials.getAppCredentials()).build();

BigQuery bigQuery = bigQueryOptions.getService();

InsertAllResponse insertAllResponse = bigQuery.insertAll(insertAllRequest);

Im seeing insert times of 3-5 seconds for each call. Needless to say this makes BQ streaming less than useful. From their documents I was worried about hitting per-table insert quotas (Im streaming from Kafka at ~1M rows / min) but now Id be happy to deal with that problem.

All rows insert fine. No errors.

I must be doing something very wrong with this setup. Please advise.

1

1 Answers

1
votes

We measure between 1200-2500 ms for each streaming request, and this was consistent over the last three years as you can see in the chart, we stream from Softlayer to Google.

enter image description here

Try to vary the numbers from hundreds to thousands row, or until you reach some streaming api limits and measure each call.

Based on this you can deduce more information such as bandwidth problem between you and BigQuery API, latency, SSL handshake, and eventually optimize it for your environment.

You can leave also your project id/table and maybe some Google engineer will check it.