U-SQL Error Extracting from BCP CSV File

Question

I have data that I extracted from SQL Server using BCP the file is ASCII CSV.
Dates in the 2016-03-03T23:00:00 format.

When running the extract I get

Additional information:

{"diagnosticCode":195887127,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_INVALID_ERROR","message":"Invalid character when attempting to convert column data.","description":"HEX: \"223022\" Invalid character when converting input record.\nPosition: line 1, column 21.","resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results and that types have to be nullable for conversion errors to be ignored.","helpLink":"","details":"============================================================================================\nHEX:5432333B35313B34362D323031362E30332E30335432333B30303B30302D302D352D323031362E30332E30335432333B35313B34392F3536372D302D323031362E30332E3033\n ^\nTEXT:T23:51:46,2016-03-03T23:00:00,0,5,2016-03-03T23:51:49.567,0,2016-03-03\n

How do you handle dates properly on extraction? It's unclear to me why it is spliting in the middle of a date time column.

A sample row looks like

50CA2FBB-95C3-4216-A729-999BE2DB491A,2016-03-03T23:51:49.567,1001464881,1001464795,1001464795,00000000-0000-0000-0000-000000000000,00000000-0000-0000-0000-000000000000,100 ,100 , ,12643,bCAwvRnNVwrKDXKxZkVed2Z1zHY=,o2lsnhueDApmvSbm31mh3aetYnc=,2016-03-03T23:50:46,2016-03-03T23:00:00,2016-03-03T23:51:46,2016-03-03T23:00:00,0,5,2016-03-03T23:51:49.567,0,2016-03-03T00:00:00,2016-03-03T23:59:59,00000000-0000-0000-0000-000000000000

Extract Statement is
@res =
EXTRACT 
        LicenseId Guid,
        EntryDate DateTime,
        UltimateId long,
        SiteId string,
        VirtualId string,
        ProjectId Guid,
        DocumentId Guid,
        MasterId string,
        ProductId string,
        FeatureString string,
        VersionId long,
        ComputerSid string,
        UserSid string,
        AppStartTime DateTime,
        StartHour DateTime,
        AppStopTime DateTime,
        StopHour DateTime,
        GmtDelta int,
        RecordedGmtDelta int,
        LastUpdated DateTime,
        Processed bool,
        StartDate DateTime,
        EndDate DateTime,
        ImsId Guid
FROM @dataFile
USING Extractors.Csv();

Interestingly If I drop out all but two of the dates in the BCP File and modify the Extract statement accordingly... It works. — Marcus.Kellermann

Michael Rys Michael Rys · Accepted Answer · 2016-03-23T01:38:13

The default encoding of the built-in extractors is Encoding.UTF-8. So most likely, the three byte sequence you see is being interpreted as UTF-8 and not ASCII.

If your BCP output really only contains code points in the ASCII range (0-127) (and not ANSI 8 bit characters), you can specify Extractors.Csv(encoding:Encoding.[ASCII]) (note the [] around ASCII to escape them from the reserved keyword rule).

If your data however is containing ANSI range characters, you have to BCP out as either UTF-16 (I don't think BCP supports UTF-8), or convert the result of BCP into UTF-8.

Note that if the file is larger than 250MB, we currently have a bug around the record boundary detection when uploading the file if it is in UTF-16 encoding. Until we have this bug fixed, I suggest you upload the file with UTF-8 encoding in that case.

Also, if you need the full ANSI codepage supported, please vote your support for the user voice item at https://feedback.azure.com/forums/327234-data-lake/suggestions/13077555-add-ansi-code-page-support-for-built-in-extractors and provide the code page that you need to have supported (e.g., Windows-1254 or ISO-Latin-1).

U-SQL Error Extracting from BCP CSV File

1 Answers