0
votes

I have many files that I'm attempting to join together. I happen to know that many of the columns in each of these files contain nothing but null values and I can do without having them in there. How can I write a U-SQL statement to extract the data from the files, check for columns of nothing but nulls and exclude them?

Thanks!

2

2 Answers

0
votes

The most performing approach would probably be to write a custom extractor that just skips the rows that are just containing null values.

Otherwise, you could write something like this (note the null indicators on the non-object types):

@data = EXTRACT c1 string, c2 int?, c3 DateTime? // ... more columns
        FROM "/path/file.csv"
        USING Extractors.Csv();

@data = SELECT * FROM @data WHERE c1 != null AND c2 != null AND c3 != null;

(note that you will have to most likely cast the null to the column type in the comparison).

If your schema is different between the different files, you could also do the filter using a so-called processor that can look at the schema of the input row. Something along the line of

@data = PROCESS @data PRODUCE c1 string, c2 int?, c3 DateTime? 
        USING new MyAsm.NullFilterProcessor();

Where you would have to implement the NullFilterProcessor as an IProcessor.

0
votes
@data = EXTRACT c1 string, c2 int?, c3 DateTime? // ... more columns
    FROM "/path/file.csv"
    USING Extractors.Csv();

Sometimes the above code will also show error, whenever the null values are already replaced with some other values like "", "\N", null, etc.

USING Extractors.Csv(nullEscape:"\N");

so we have to use nullEscape parameter in the default Extractor in order to exclude the null values in the file.