2
votes

Multiple source systems I want to process using Azure Data Lake contain a carriage return, linefeed within a column.

This causes Extract in ADLA to fail with the following error:

E_RUNTIME_USER_EXTRACT_UNEXPECTED_ROW_DELIMITER

Trying to find a working configuration to not be running into this issue anymore. The native Extractor documentation on Microsoft.com describes this:

Note that the rowDelimiter character inside a quoted string will not be escaped and will be used as a row separator which will lead to incorrect or failing extractions.

https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/extractor-parameters-u-sql

Unfortunately this fails to mention a good workaround.

I tried switching to another format like Orc or Parquet. However, for the time being, these seem not to be fully supported yet. As this limits the functionality of the development environment, I would rather not use these formats for now.

This issue seems highly likely to occur, yet I am unable to find a good solution. What is a good and standard solution to work around this issue while still keeping the convenience of using csv/tsv to store files?

1

1 Answers

1
votes

I've accomplished this by creating a custom extractor based on a third party CSV Parser. Specifically, the CsvParser class from Josh Close's fantastic CsvHelper library. Works like a charm. Don't forget to set AtomicFileProcessing = true.