0
votes

I get reviews from customers into azure data lake which are combined/formatted in U-SQL. Some of the customers leave a space in between their comment lines. Any idea how to remove these blank lines using U-Sql?

E.g of a single comment

My name is abc
<blank line>
I love playing football. 

Need to remove the blank line in between. Trim() removes the start and end whitespaces only. Thanks.

1
How does the source look like (content) and what format and structure is it anyway?Peter Bons
Its just 2 columns which are customer ID, review. Both are strings.Jobi

1 Answers

0
votes

You can use RegEx and U-SQL to clean up your text, for example, this simple script replaces two carriage returns with a single carriage return, removing blank lines:

@input = 
    SELECT * FROM
        ( VALUES
            ( 1, @"My name is abc

I love playing football." ),
            ( 2, @"I love U-SQL

I'm indifferent to Hadoop." )
        ) AS x( id, review );



// Strip out repeated carriage returns
@output =
    SELECT id, 
            Regex.Replace(review, "(\r\n){2}", "\r\n", RegexOptions.Multiline) AS cleanedReview
    FROM @input;


OUTPUT @output
TO "/output/output.csv"
USING Outputters.Csv();

You may have to experiment with the RegEx expression to clean your particular data.