How Dataflow works with BIgQuery Dataset

Question

I don't found how get tables from a dataset specified. I want use Dataflow for migrate tables since Dataset US to dataset location EU. I would like get all tables in paralel process of dataset US and write the tables in dataset EU.

Beam 2.4 is using com.google.api.services.bigquery v2-rev374-1.22.0. This is also the library that you should use with Beam 2.4.

The code run successfully with DirectRunner but If I run with DataflowRunner doesn't run and throw the error

un 29, 2018 1:52:48 PM com.google.api.client.http.HttpRequest execute
ADVERTENCIA: exception thrown while executing request
java.net.SocketException: Unexpected end of file from server
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:851)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
        at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
        at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
        at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
        at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
        at com.google.api.client.googleapis.media.MediaHttpUploader.executeUploadInitiation(MediaHttpUploader.java:519)
        at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:384)
        at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

jun 29, 2018 1:52:55 PM com.google.api.client.http.HttpRequest execute
ADVERTENCIA: exception thrown while executing request
java.net.SocketException: Unexpected end of file from server
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:851)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
        at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
        at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
        at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
        at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
        at com.google.api.client.googleapis.media.MediaHttpUploader.executeUploadInitiation(MediaHttpUploader.java:519)
        at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:384)
        at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
        at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:357)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

the code example

package emp.customerjourney.etls;

import com.google.api.client.googleapis.auth.oauth2.GoogleCredential;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.http.javanet.NetHttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.bigquery.Bigquery;
import com.google.api.services.bigquery.Bigquery.*;
import com.google.api.services.bigquery.BigqueryScopes;
import com.google.api.services.bigquery.model.*;


import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;


import org.apache.beam.sdk.options.Default;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.Validation;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;

public class Migracion {

    public static interface MyOptions extends DataflowPipelineOptions {
        @Description("BigQuery table to write to, specified as "+ "<project_id>:<dataset_id>.<table_id>. The dataset must already exist.")
        @Default.String("customerjourney:prueba.weather_stations")
        @Validation.Required
        String getOutput();
        void setOutput(String s);

        @Description("Table to read from, specified as "+ "<project_id>:<dataset_id>.<table_id>")
        @Default.String("customerjourney:118678548.gsod3")
        String getInput();
        void setInput(String value);


    }


    public static Bigquery createAuthorizedClient() throws IOException {
        // Create the credential
        HttpTransport transport = new NetHttpTransport();
        JsonFactory jsonFactory = new JacksonFactory();
        GoogleCredential credential = GoogleCredential.getApplicationDefault(transport, jsonFactory);

        if (credential.createScopedRequired()) {
            credential = credential.createScoped(BigqueryScopes.all());
        }

        return new Bigquery.Builder(transport, jsonFactory, credential)
                .setApplicationName("Bigquery Samples")
                .build();
    }


    public static final void main(String args[]) throws Exception {


        String projectId = "customerjourney";
        String datasetName = "dsorigen";
        // Create a new Bigquery client authorized via Application Default Credentials.
        Bigquery bigquery = createAuthorizedClient();
        Bigquery.Tables.List lista=bigquery.tables().list(projectId,datasetName);
        TableList rp= lista.execute();
        List<TableList.Tables> tblista =rp.getTables();
        String  entrada=tblista.get(3).getId();

        MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);

        options.setTempLocation("gs://pruebasg/teststaging");
        options.setRegion("europe-west1");
        options.setStagingLocation("gs://pruebasg/temp_dataflow_tasks");
        Pipeline p = Pipeline.create(options);

        // Build the table schema for the output table.
        List<TableFieldSchema> fields = new ArrayList<>();
        fields.add(new TableFieldSchema().setName("month").setType("INTEGER"));
        fields.add(new TableFieldSchema().setName("tornado_count").setType("INTEGER"));
        TableSchema schema = new TableSchema().setFields(fields);

       // p.apply(BigQueryIO.readTableRows().from(options.getInput()))
        p.apply(BigQueryIO.readTableRows().from(entrada)) //get dataset name form api Bigquery V2
                .apply(new BigQueryTornadoes.CountTornadoes())
                .apply(BigQueryIO.writeTableRows()
                        .to(options.getOutput())
                        .withSchema(schema)
                        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));



        p.run().waitUntilFinish();

        options.getExecutorService().shutdown();
        try {
            options.getExecutorService().awaitTermination(3, TimeUnit.MINUTES);
        } catch (InterruptedException e) {
            System.out.println("Thread was interrupted waiting for execution service to shutdown.");
        }
        System.out.println("termino");

    }

}

<dependencies>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-direct-java</artifactId>
        </dependency>

        <dependency>
            <groupId>com.google.cloud.dataflow</groupId>
            <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
            <version>[2.4.0, 2.99)</version>
        </dependency>

        <!-- slf4j API frontend binding with JUL backend -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.14</version>
        </dependency>

       <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-jdk14</artifactId>
            <version>1.7.14</version>
        </dependency>

     <dependency>
            <groupId>com.google.apis</groupId>
            <artifactId>google-api-services-bigquery</artifactId>
            <version>${bigquery.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>22.0</version>
        </dependency>

              <dependency>
                  <groupId>com.google.oauth-client</groupId>
                  <artifactId>google-oauth-client</artifactId>
                  <version>1.21.0</version>
              </dependency>
              <dependency>
                  <groupId>com.google.http-client</groupId>
                  <artifactId>google-http-client-jackson2</artifactId>
                  <version>1.21.0</version>
              </dependency>
              <dependency>
                  <groupId>com.google.oauth-client</groupId>
                  <artifactId>google-oauth-client-jetty</artifactId>
                  <version>1.21.0</version>
              </dependency>
              <dependency>
                  <groupId>com.google.code.gson</groupId>
                  <artifactId>gson</artifactId>
                  <version>2.7</version>
              </dependency>
              <dependency>
                  <groupId>junit</groupId>
                  <artifactId>junit</artifactId>
                  <version>4.12</version>
                  <scope>test</scope>
              </dependency>
              <dependency>
                  <groupId>com.google.truth</groupId>
                  <artifactId>truth</artifactId>
                  <version>0.29</version>
                  <scope>test</scope>
              </dependency>


          </dependencies>

Dataflow would need GOOGLE_APPLICATION_CREDENTIALS json file to execute this code?

I don't found the method list tables of the exist Dataset with BigQuery IO. Reference post

I Could you help me please? –

Just wondering, have you considered copying the tables via UI? On BigQuery UI just go to the left menu with all datasets and tables, hover on the table to copy and you'll see a small button on the right where you can select Copy Table and on the popup menu you can select your destination dataset and proceed. — Carlos
I have a multitude of tables, so I must use dataflow or Google Cloud Storage. I made the migration with Google Cloud Storage but I wanted to do it with Google Dataflow — user2291521
I am getting a similar issue when pulling 7k records from a table in BigQuery. I see the elements added in input and out put increasing until 3/4 or so and then it throws that Unexpected end of file from server error, and starts over. It does this 4 times and then gives up and errors out. :( — nomadic_squirrel

Héctor Neri Héctor Neri · Accepted Answer · 2018-10-29T20:15:33

This is a connection error. I suggest first to verify that you're setting your credentials properly, as explained in other answers, and when still getting this error, do the following:

First, this error indicates that the TCP socket has been closed before the server was able to send a response. Some possible reasons are:

Network connection was lost
The server decided to close the connection
Something in between the client and the server terminated the request

Consider to use tools like netstat/traceroute/ping to look to see if you can find an issue along the route for the cases 1 and 3.

If the server is the one closing the connection, then your Request may not be correct or it's throttling you in order to avoid congestion. Those are the cases when quick retries don't help (in some cases, if the server have issues at the same moment, then retrying helps). You can try a exponential back-off and retry strategy, and if it doesn't work, then your request may not be a correct one.

How Dataflow works with BIgQuery Dataset

2 Answers