Removing null columns with U-SQL

Question

I have many files that I'm attempting to join together. I happen to know that many of the columns in each of these files contain nothing but null values and I can do without having them in there. How can I write a U-SQL statement to extract the data from the files, check for columns of nothing but nulls and exclude them?

Thanks!

Michael Rys Michael Rys · Accepted Answer · 2016-10-10T18:49:52

The most performing approach would probably be to write a custom extractor that just skips the rows that are just containing null values.

Otherwise, you could write something like this (note the null indicators on the non-object types):

@data = EXTRACT c1 string, c2 int?, c3 DateTime? // ... more columns
        FROM "/path/file.csv"
        USING Extractors.Csv();

@data = SELECT * FROM @data WHERE c1 != null AND c2 != null AND c3 != null;

(note that you will have to most likely cast the null to the column type in the comparison).

If your schema is different between the different files, you could also do the filter using a so-called processor that can look at the schema of the input row. Something along the line of

@data = PROCESS @data PRODUCE c1 string, c2 int?, c3 DateTime? 
        USING new MyAsm.NullFilterProcessor();

Where you would have to implement the NullFilterProcessor as an IProcessor.

Removing null columns with U-SQL

2 Answers