I'm currently writing a Java utility to import few CSV files from GCS into BigQuery. I can easily achieve this by bq load
, but I wanted to do it using a Dataflow job. So I'm using Dataflow's Pipeline and ParDo transformer (returns TableRow to apply it on the BigQueryIO) and I have created the StringToRowConverter() for the transformation. Here the actual problem starts - I am forced to specify the schema for the destination table although I don't want to create a new table if it doesn't exist - only trying to load data. So I do not want to manually set the column name for the TableRow as I have about 600 columns.
public class StringToRowConverter extends DoFn<String, TableRow> {
private static Logger logger = LoggerFactory.getLogger(StringToRowConverter.class);
public void processElement(ProcessContext c) {
TableRow row = new TableRow();
row.set("DO NOT KNOW THE COLUMN NAME", c.element());
c.output(row);
}
}
Moreover, it is assumed that the table already exists in the BigQuery dataset and I don't need to create it, and also the CSV file contains the columns in a correct order.
If there's no workaround to this scenario and the column name is needed for the data load, then I can have it in the first row of the CSV file.
Any help will be appreciated.