0
votes

I have few hundred file represent 400Gb of data with CSV format with specificicatons belows

  • enclosed : double quote
  • separator : comma
  • escape character : antislash

my data could be

a,30,"product, A","my product : \"good product\""

I think BQ evaluated data as

col 1: a col 2 : 30 col 3 : product col 4 : A col 5 : my product : "good product"

and i want

col 1: a col 2 : 30 col 3 : product, A col 4 : my product : "good product"

It's possible to load this kind of file without use dataflow or dataprep

bq load --noreplace --source_format=CSV --max_bad_records=1000000 --allow_jagged_rows ods.my_file gs://file/file.csv.gz

My data were shifted and bigquery didn't load rows

Error while reading data, error message: Could not parse 'XXX' as int for field (position 49) starting at location 2121

Data between close double quote (") and field separator.

1
Are these CSVs exported from MySQL? Check medium.com/google-cloud/… - Felipe Hoffa
Tks Felipe, it's an extract of anotehr system witch I can't change. It's not possible to define escape character ? - Pierre.c
Probably the quickest path will be to use a CLI tool to transform the escaping - Felipe Hoffa
Agree with Felipe. I recently did similar transformation. This works for Mac, but slightly different syntax for Linux. sed -i '' 's/\"\"STRING_WITH_EXTRA_QUOTE_HERE\"/\"STRING_WITH_EXTRA_QUOTE_HERE\"/g' YOUR_FILE.csv - Jas

1 Answers

0
votes

Try the options in the comments, or to parse inside BigQuery, try loading each row raw and running a UDF like this:

SELECT csv.cols[SAFE_OFFSET(0)] a
  , csv.cols[SAFE_OFFSET(1)] b
  ,csv.cols[SAFE_OFFSET(2)] c
  ,csv.cols[SAFE_OFFSET(3)] d
  ,csv.cols[SAFE_OFFSET(4)] e
  ,csv.cols[SAFE_OFFSET(5)] f
FROM (
  SELECT fhoffa.x.parse_csv('hello,"dealing here, with", some \" quoted words, all is fine, good one?,"even a , in between"')  csv
)

Ref: https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83

enter image description here