3
votes

I am new to u-sql of azure datalake analytics. I want to do what I think is a very simple operations but ran into trouble. Basically: I want to create a query which ignore empty string. using it in select works, but not in WHERE statement.

Below the statement I am making and the cryptic error I get

JOB

@xsel_res_1 = 
EXTRACT 
x_paper_id  long,
x_Rank  uint,
x_doi   string,
x_doc_type  string,
x_paper_title   string,
x_original_title    string,
x_book_title    string,
x_paper_year    int,
x_paper_date    DateTime?,
x_publisher string,
x_journal_id    long?,
x_conference_series_id  long?,
x_conference_instance_id    long?,
x_volume    string,
x_issue string,
x_first_page    string,
x_last_page string,
x_reference_count   long,
x_citation_count    long?,
x_estimated_citation    int?
FROM @"adl://xmag.azuredatalakestore.net/graph/2018-02-02/Papers.txt"
USING Extractors.Tsv()
; 

@xsel_res_2 = 
SELECT 
x_paper_id        AS x_paper_id,
x_doi.ToLower()   AS x_doi,
x_doi.Length     AS x_doi_length
FROM @xsel_res_1
WHERE NOT string.IsNullOrEmpty(x_doi)
;

@xsel_res_3 = 
SELECT 
* 
FROM @xsel_res_2
SAMPLE ANY (5)
;

OUTPUT @xsel_res_3
TO @"/graph/2018-02-02/x_output/x_papers_x6.tsv"
USING Outputters.Tsv();

THE ERROR

Vertex failed
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][1]             with error: Vertex user code error.
VertexFailedFast: Vertex failed with a fail-fast error

E_RUNTIME_USER_EXTRACT_ROW_ERROR: Error occurred while extracting row    after processing 10 record(s) in the vertex' input split. Column index: 5, column name: 'x_original_title'.

E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD:     Invalid character following the ending quote character in a quoted field.

Row selected
Component
RUNTIME
Message
Invalid character following the ending quote character in a quoted field.
Resolution

Column should be fully surrounded with double-quotes and double-quotes within the field escaped as two double-quotes.

Description
Invalid character is detected following the ending quote character in a quoted field. A column delimiter, row delimiter or EOF is expected. This error can occur if double-quotes within the field are not correctly escaped as two double-quotes.
Details

Row Delimiter: 0x0
Column Delimiter: 0x9
HEX: 61 76 6E 69 20 74 65 72 6D 69 6E 20 75 20 70 6F 76 61 6C 6A 73 6B 6F 6A 20 6C 69 73 74 69 6E 69 20 69 20 6E 61 74 70 69 73 75 20 67 20 31 31 38 35 09 22 50 6F 20 6B 6F 6E 63 75 22 ### 20 28 73 74 61 72 69 20 68 72

UPDATE BY the way, the operations work on other datasets, so the problem is not the syntax as far as I can tell

 //Define schema of file, must map all columns
 @searchlog = 
 EXTRACT UserId          int, 
        Start           DateTime, 
        Region          string, 
        Query           string, 
        Duration        int, 
        Urls            string, 
        ClickedUrls     string
FROM @"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();


 @searchlog_1 =
 SELECT * FROM  @searchlog
 WHERE NOT string.IsNullOrEmpty(ClickedUrls );


 OUTPUT @searchlog_1
   TO @"/Samples/Output/SearchLog_output_x1.tsv"
    USING Outputters.Tsv();
2

2 Answers

3
votes

This is an unfortunate error display for this case.

Assuming text is utf-8, you can use a site like www.hexutf8.com to convert the hex to:

avni termin u povaljskoj listini natpisu g 1185 "Po koncu" (Stari hr

It looks like the input row contains at least one " character that is not properly escaped. It should look like this:

avni termin u povaljskoj listini natpisu g 1185 ""Po koncu"" (Stari hr
3
votes

@Saveenr's answer assumes that the values in your file are all quoted. Alternatively, if they are not quoted (and do not contain your column separator as values), then setting Extractors.Tsv(quoting:false) could help as well.