I'm using Dataflow to write data into BigQuery using BigQueryIO.Write.to()
.
Sometimes, I get this warning from Dataflow:
{
metadata: {
severity: "WARNING"
projectId: "[...]"
serviceName: "dataflow.googleapis.com"
region: "us-east1-d"
labels: {
compute.googleapis.com/resource_type: "instance"
compute.googleapis.com/resource_name: "dataflow-[...]-08240401-e41e-harness-7dkd"
dataflow.googleapis.com/region: "us-east1-d"
dataflow.googleapis.com/job_name: "[...]"
compute.googleapis.com/resource_id: "[...]"
dataflow.googleapis.com/step_id: ""
dataflow.googleapis.com/job_id: "[...]"
}
timestamp: "2016-08-30T11:32:00.591Z"
projectNumber: "[...]"
}
insertId: "[...]"
log: "dataflow.googleapis.com/worker"
structPayload: {
message: "exception thrown while executing request"
work: "[...]"
thread: "117"
worker: "dataflow-[...]-08240401-e41e-harness-7dkd"
exception: "java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:961)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:918)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1535)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter$1.call(BigQueryTableInserter.java:229)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter$1.call(BigQueryTableInserter.java:222)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)"
logger: "com.google.api.client.http.HttpTransport"
stage: "F5"
job: "[...]"
}
}
I don't see any "retry" log following this one.
My questions are:
- Am I losing data? I don't know if the write operation is done correctly. If I understand the code correctly the entire write batch is in an uncertain state.
- If so, is there a way for me to be certain to write data to BigQuery exactly once?
- If so, shouldn't severity be ERROR instead of WARNING?
Here's a bit of context of my usage:
- I'm using Dataflow in streaming mode, reading from Kafka using KafkaIO.java
- "Sometimes" can be from 0 to 3 times per hour
- Depending on the job, I'm using 2 to 36 workers of type n1-standard-4
- Depending on the job, I'm writing from 3k to 10k messages/s to BigQuery
- Average message size is 3kB
- Dataflow workers are in the us-east1-d zone, BigQuery dataset location is US