1
votes

I'm using Azure Data Lake Store with Azure Data Lake Analytics. My data directories are split out in YYYY, MM, DD with the file included in the day directory with a YYYYMMDDD_Filename.csv format. One file per directory.

The files themselves are UTF-8 with comma delimiters (no double quotes). They contain a header row and rows are delimited by CR LF.

I can clearly view the file without issues in Data Lake Preview. I can also run the code below on a single file and it works fine. All rows are fetched and outputted to a single CSV file.

However, the moment I take the same code and go into expressions using a virtual column such as Date in order to aggregate more than one file. It gives me the vertex error after it finds all the files and tries to fetch/aggregate the rows in each file it found.

Couple things to consider

  • The code syntax passes without error.
  • The code finds all the directories and files fine (job graph shows this with file sizes that match the right files).
  • The code fails when reading the files in the directories (too many columns)
  • All files have been verified and pulled seperately with similar code when targeting a single file, not using the expressions.

Code example

DECLARE @dir string = "/raw/meh/product/test/report/";
DECLARE @file_set_path2 string = @dir + "{Date:yyyy}/{Date:MM}/{Date:dd}/{*}.csv";

@data = 
    EXTRACT Date DateTime, // virtual file set column
            Cat1 string, 
            Cat2 string, 
            Cat3 string, 
            Cat4 string, 
            Cat5 string, 
            Cat6 string, 
            Cat7 string
    FROM @file_set_path2
    USING Extractors.Text(quoting: false, delimiter: ',', skipFirstNRows:1);

@result = 
    SELECT *
    FROM @data
    WHERE Date >= DateTime.Now.AddDays(-2) 
    AND Date <= DateTime.Now.AddDays(-1)
    AND Cat1 != "Sales";

OUTPUT @result
TO "/output/meh/product/test/report/TestData.csv"
USING Outputters.Csv();

Error

  • Row Delimiter: 0x0
  • Column Delimiter: 0x2C
  • TEXT: \r\n2018-01-23....### 20\r\n2018-0

The only thing I can think here is the fact it's showing an error of \r\n and quoting the line of the error, which clearly shows the next row of data in the error. This doesn't make sense because again, the same code without the regular expressions (i.e.: {YYYY}/{MM} and Date) does not error at all.

Is Azure Data Lake Analytics converting my UTF-8 CR LF File into something else when I use regular expressions and merging all my rows to one single line? Then errorring out because it goes from 8 fields to 10808 fields?

Anyway to address this? This is driving me nuts because clearly every file is accessible in silos using similar query. Analyzing the files looks compliant with UTF-8, comma delimited and all that jive.

2

2 Answers

1
votes

Needed to include two dates here. One for the date field in my data and one for the virtual file set column for the expression.

DECLARE @dir string = "/raw/meh/product/test/report/";
DECLARE @file_set_path2 string = @dir + "{date:yyyy}/{date:MM}/{date:dd}/{*}.csv";

@data = 
    EXTRACT Date DateTime, // My actual field for date
            Cat1 string, 
            Cat2 string, 
            Cat3 string, 
            Cat4 string, 
            Cat5 string, 
            Cat6 string, 
            Cat7 string,
            date DateTime // virtual file set column (ACTUAL!)
    FROM @file_set_path2
    USING Extractors.Text(quoting: false, delimiter: ',', skipFirstNRows:1);

@result = 
    SELECT *
    FROM @data
    WHERE Date >= DateTime.Now.AddDays(-2) 
    AND Date <= DateTime.Now.AddDays(-1)
    AND Cat1 != "Sales";

OUTPUT @result
TO "/output/meh/product/test/report/TestData.csv"
USING Outputters.Csv();
0
votes

That looks indeed suspicious.

The TEXT field shows you where it found the "wrong column" that it was not expecting. Is that part of the data or header of the file? Does this file work if it is the only file?