1
votes

I have data in the cloud storage and I want to transfer the data to big query and get statistics. currently I'm using a JobConfigurationLoad to get a single file here is a sample of the code:

JobConfigurationLoad jobconfigurationqLoad = new JobConfigurationLoad();
jobconfigurationqLoad.setSkipLeadingRows(1); // First line are columns names
    jobconfigurationqLoad.setSourceUris(Lists.newArrayList("gs://my_app/folder_name/test_file.csv"));
jobconfigurationqLoad.setWriteDisposition("WRITE_APPEND");
jobconfigurationqLoad.setEncoding(PlatformConstants.DEFAULT_ENCODING);
jobconfigurationqLoad.setCreateDisposition("CREATE_IF_NEEDED");
jobconfigurationqLoad.setDestinationTable(tableReference);
**tableReference = my table in big query
jobconfigurationqLoad.setSchemaInline("field1:STRING,field2:STRING");

// JobConfiguration
JobConfiguration jobConfiguration = new JobConfiguration();
jobConfiguration.setLoad(jobconfigurationqLoad);

// JobReference
JobReference jobreference = new JobReference();
jobreference.setProjectId(PROJECT_ID);

// Job
Job insertJob = new Job();
insertJob.setConfiguration(jobConfiguration);
insertJob.setJobReference(jobreference);

In "setSourceUris" I wanted to put only the folder and get all the files that are there but that doesn't seems to work. I saw it the google api some doc about getting a bucket content but not only one folder inside the bucket. something similar is in this answer. i'm using GAE with java.

2

2 Answers

2
votes

The BigQuery API's sourceUris method requires that you list each source URI separately (it's not possible to provide a single Google Cloud Storage bucket URI).

However, yes you can use the Google Cloud Storage API to provide a list of object URIs. Provide a prefix parameter to filter the result list.

Note that the maximum amount of files you can include in a single load job is 500 (and the maximum amount of data per single load request is 1Tb - see the BigQuery quota page).

0
votes

BigQuery Apis has property: configuration.load.sourceUris[] which is an array that can contain one or more file. the names must be "fully-qualified names, for example: gs://mybucket/myobject.csv"

for more info , take a look at: https://developers.google.com/bigquery/docs/reference/v2/jobs

so, as Michael said: " use the Google Cloud Storage API to provide a list of object URIs. Provide a prefix parameter to filter the result list."

and then place the file names in the sourceUris array of your job.