How to fill a blank cell with the data above using U-SQL

Question

I have a csv file which I am trying to process using Azure Data Lake Analytics U-SQL. I am fairly new to U-SQL so please bear with me. The original file is semi-structured which I managed to fix using the silent:true flag. Now that it is more structured, I would like to fill the empty cells with the data in the above cells.

My data looks like this: CSV with empty cells

My problem lies with the empty cells in the first four columns.

The second row has data which I would like to copy down into the empty cells below it (rows 3-5). The data from row 7 needs to be copied down to row 8, the data from row 9 to be copied down to rows 10-13 and the data from row 14 to be copied to rows 15-18.

This has to be done without changing the values in the 'Amount claimed' column.

Does anyone have any ideas on how to achieve this in U-SQL?

Thank you.

Thanks everyone for the guidance. I am not a developer so it may take a while to figure it out, but I sincerely appreciate all the help. — Ally

Nabeel Nabeel · Accepted Answer · 2017-06-16T20:02:57

U-SQL is generally a language for processing large, order-agnostic data - this problem is not a good fit for it

There is no row ordering criterion in the input data.
Recreating input row ordering within U-SQL can only be done by restricting extraction to 1 vertex - which limits the input data size and is somewhat at odds with using a parallel large data processing language.

Rowsets - the fundamental USQL building blocks - are unordered logical data containers. Thus the order of lines in the original input is lost the moment you read it into a rowset; you have to recreate the order within U-SQL using some ordering Key.

Assuming there is such an ordering key,

@data = SELECT A, LAST_VALUE(Col == "" ? null : Col) OVER (ORDER BY Key ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Col FROM @input;

should do it since LAST_VALUE should ignore nulls. Note - USQL documentation doesn't actually specify whether nulls are ignored - they should be per general aggregate / windowing function conventions, but this needs to be verified.

Your data doesn't have an ordering column - to create one, you would need to

Ensure all data is processed by 1 vertex - [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Add an ordering column within a custom Extractor.

This may be too complicated for an amount of data that you could just process locally before uploading.

How to fill a blank cell with the data above using U-SQL

4 Answers