2
votes

I would like to parse big CSV files in ABAP in the most performant way under the following conditions:

  1. We do not know the structure of the CSV->the parse result should be table of string_table or something simular
  2. The parsing should happen in accordance to https://www.rfc-editor.org/rfc/rfc4180
  3. No solution specific calls

I found a very nice blog https://blogs.sap.com/2014/09/09/understanding-csv-files-and-their-handling-in-abap/ but it has its shortcoming:

  1. Write your own code - The code example is not sufficient
  2. Read the file using KCD_CSV_FILE_TO_INTERN_CONVERT - solution specific (not available everywhere) and will dump on fields that are big enough
  3. Use RTTI and dynamic programming along with FM RSDS_CONVERT_CSV - we do not know the structure in advance
  4. Use class CL_RSDA_CSV_CONVERTER - we do not know the structure in advance

I also checked the first available solution on github - https://github.com/thedoginthewok/ZwdCSV . Unfortunately, it has macros in the code (absolutely unacceptable) and also requires you to know the structure in advance.

I also tried to use the regex to do the job, but on big files this is too slow.

Even though I am extremely annoyed by this fact, I had to create a solution myself (I cannot believe that I actually did it - it should be in the standard...) My first solution was a direct copy paste of Java code into ABAP (https://mkyong.com/java/how-to-read-and-parse-csv-file-in-java/). Unfortunately, as my other question How to iterate over string characters in ABAP in performant way? shown, it is not that easy to iterate over string in abap as it is in Java. I then tried a split/count approach and so far it has the best performance. Does anyone knows the better way achieve this?

REPORT z_csv_test.


CLASS lcl_csv_parser DEFINITION CREATE PRIVATE.

  PUBLIC SECTION.
    TYPES:
             tt_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
    CLASS-METHODS:
      create
        IMPORTING
          !iv_delimiter      TYPE string DEFAULT  '"'
          !iv_separator      TYPE string DEFAULT  ','
          !iv_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf
        RETURNING
          VALUE(r_result)    TYPE REF TO lcl_csv_parser.
    METHODS:
      parse
        IMPORTING
          iv_string       TYPE string
        RETURNING
          VALUE(r_result) TYPE tt_string_matrix,
      constructor
        IMPORTING
          !iv_delimiter      TYPE string
          !iv_separator      TYPE string
          !iv_line_separator TYPE string.
  PROTECTED SECTION.
  PRIVATE SECTION.
    DATA:
      gv_delimiter         TYPE string,
      gv_separator         TYPE string,
      gv_line_separator    TYPE string,
      gv_escaped_delimiter TYPE string.
    METHODS parse_line_to_string_table
      IMPORTING
        iv_line         TYPE string
      RETURNING
        VALUE(r_result) TYPE string_table.
ENDCLASS.

CLASS lcl_csv_parser IMPLEMENTATION.

  METHOD create.
    r_result = NEW #(
      iv_delimiter      = iv_delimiter
      iv_line_separator = CONV #( iv_line_separator )
      iv_separator      = iv_separator  ).
  ENDMETHOD.

  METHOD constructor.
    me->gv_delimiter = iv_delimiter.
    me->gv_separator = iv_separator.
    me->gv_line_separator = iv_line_separator.
    me->gv_escaped_delimiter = |{ iv_delimiter }{ iv_delimiter }|.
  ENDMETHOD.

  METHOD parse.
    "get the lines
    SPLIT iv_string AT me->gv_line_separator INTO TABLE DATA(lt_lines).
    DATA lx_open_line TYPE abap_bool VALUE abap_false.
    DATA lv_current_line TYPE string.

    LOOP AT lt_lines ASSIGNING FIELD-SYMBOL(<ls_line>).

      FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_line> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
      IF ( lv_count MOD 2 )  = 1.
        IF lx_open_line = abap_true.
          lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
          lx_open_line = abap_false.
          APPEND parse_line_to_string_table( lv_current_line ) TO r_result.
        ELSE.
          lv_current_line = <ls_line>.
          lx_open_line = abap_true.
        ENDIF.
      ELSE.
        IF lx_open_line = abap_true.
          lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
        ELSE.
          APPEND parse_line_to_string_table( <ls_line> ) TO r_result.
        ENDIF.

      ENDIF.
    ENDLOOP.

  ENDMETHOD.


  METHOD parse_line_to_string_table.
    SPLIT iv_line AT me->gv_separator INTO TABLE DATA(lt_line).
    DATA lx_open_field TYPE abap_bool VALUE abap_false.
    DATA lv_current_field TYPE string.
    LOOP AT lt_line ASSIGNING FIELD-SYMBOL(<ls_field>).
      FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_field> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
      IF ( lv_count MOD 2 ) = 1.
        IF lx_open_field = abap_true.
          lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
          lx_open_field = abap_false.
          APPEND lv_current_field TO r_result.
        ELSE.
          lv_current_field = <ls_field>.
          lx_open_field = abap_true.
        ENDIF.
      ELSE.
        IF lx_open_field = abap_true.
          lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
        ELSE.
          APPEND <ls_field> TO r_result.
        ENDIF.
      ENDIF.

    ENDLOOP.

    REPLACE ALL OCCURRENCES OF me->gv_escaped_delimiter IN TABLE r_result WITH me->gv_delimiter.

  ENDMETHOD.

ENDCLASS.
CLASS lcl_test_csv_parser DEFINITION
  FINAL
  CREATE PUBLIC .

  PUBLIC SECTION.
    CLASS-METHODS run.
    CLASS-METHODS get_file
      RETURNING VALUE(r_result) TYPE string.



  PROTECTED SECTION.
  PRIVATE SECTION.

ENDCLASS.



CLASS lcl_test_csv_parser IMPLEMENTATION.

  METHOD get_file.
    DATA lv_file_line TYPE string.
    DO 10 TIMES.
      lv_file_line = |"1234,{ cl_abap_char_utilities=>cr_lf }567890",{ lv_file_line }|.
    ENDDO.
    lv_file_line = lv_file_line && cl_abap_char_utilities=>cr_lf.

    DATA(lt_file_as_table) = VALUE string_table(
        FOR i = 1 THEN  i + 1 UNTIL  i = 1000000
            ( lv_file_line ) ).

    CONCATENATE LINES OF lt_file_as_table INTO r_result.
  ENDMETHOD.





  METHOD run.
    DATA lv_prepare_start TYPE timestampl.
    GET TIME STAMP FIELD lv_prepare_start.

    DATA(lv_file) = get_file(  ).

    DATA lv_prepare_end TYPE timestampl.
    GET TIME STAMP FIELD lv_prepare_end.

    WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = lv_prepare_end tstmp2 = lv_prepare_start ) }|.

    DATA lv_parse_start TYPE timestampl.
    GET TIME STAMP FIELD lv_parse_start.
    DATA(lo_parser) = lcl_csv_parser=>create(  ).
    DATA(lt_file) = lo_parser->parse( lv_file  ).
    DATA lv_parse_end TYPE timestampl.
    GET TIME STAMP FIELD lv_parse_end.

    WRITE |Parse took { cl_abap_tstmp=>subtract( tstmp1 = lv_parse_end tstmp2 = lv_parse_start ) }|.


  ENDMETHOD.



ENDCLASS.

START-OF-SELECTION.
  lcl_test_csv_parser=>run( ).
1
In your "other question", there was no evidence of faster string parsing in ABAP so to have a positive answer this question it would mean a CSV parser built in the kernel. You're also looking for something official and working in ABAP 7.50. You didn't talk about possible solutions discussed: POSIX RegEx is unstable and slow, PCRE RegEx since ABAP 7.55 is more stable and fast, the RND parser is based on kernel but not official. SAP don't propose an official CSV parser in ABAP. You want something very fast, so you should choose a non-ABAP solution.Sandra Rossi
NB: 1) |{ cl_abap_char_utilities=>cr_lf }| can be written |\r\n|. 2) When extending the string lv_file_line it's much more performing by placing the variable at first position (kernel optimization): lv_file_line = |{ lv_file_line }"1234,\r\n567890",| (instead of lv_file_line = |"1234,\r\n567890",{ lv_file_line }|).Sandra Rossi
we have many CSV feeds that have comments populated by humans. So it is not that every field will get them, but most likely every line will have them now again we have some matrix feeds where this is not THAT relevant - but there it is still a possibility to have them in each row column due to the textual description of dimension elementsDmitry Yudin
To be honest StackExchange has so many sites, and sometimes their question scopes overlap. I think this question could belong both to StackOverflow and CodeReview (though the ABAP community on StackOverflow is a bit more active, given that there are just two [abap] questions on code review ...)Jonas Wilms
By the way, your test data is invalid, the RFC says "The last field in the record MUST NOT be followed by a comma"Jonas Wilms

1 Answers

0
votes

I'd like to present a different approach using find heavily, compared to your line based approach this seems to have equivalent performance for unquoted fields but performs slightly better if quoted fields are present:

performance table

In general, this uses the pattern position = find( off = position + 1 ) to iterate over the string in chunks, and then uses substring to copy ranges into strings. What can be observed here is that in a loop that iterates a million times, every nanosecond saved has an impact on the performance, and by moving as much of it out of the inner loop one can increase performance significantly. For the "simple" case of 10 digit fields one can see that both algorithms perform equally well, however for "longer" 30 digit fields your algorithm is getting faster in comparison. For fields with quotes the scan & concat approach I've used seems to be faster than the "reconstruct" approach. I guess although one can achieve small gains through more clever ABAP, further significant optimizations are only possible by utilizing the engine even more.

Anyways, Here's the algorithm:

CLASS lcl_csv_parser_find IMPLEMENTATION.
  METHOD parse.
    DATA line TYPE string_table.
    DATA position TYPE i.
    DATA(string_length) = strlen( i_string ).

    " Dereferencing member fields is slightly slower than variable access, in a close loop this matters
    DATA(separators) = me->separators.
    DATA(delimiter)  = me->delimiter.

    CHECK string_length <> 0.

    " Checking for delimiters in the DO loop is quite slow. 
    " By scanning the whole file once and skipping that check if no delimiter is present
    " This lead to a slight performance increase of 1s for 1 million rows
    DATA(next_delimiter) = find( val = i_string sub = delimiter ).

    DO.
      DATA(start_position) = position.
      DATA(field) = ``.
      " Check if field is enclosed in double quotes, as we need to unescape then
      IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
         start_position = start_position + 1. " literal starts after opening quote

         DO.
            position = find( val = i_string off = position + 1 sub = delimiter ).
            " literal must be closed
            " ASSERT position <> -1.

            DATA(subliteral_length) = position - start_position.
            field = field && substring( val = i_string off = start_position len = subliteral_length ).

            DATA(following_position) = position + 1.
            IF position = string_length OR i_string+following_position(1) <> delimiter.
              " End of literal is reached
              position = position + 1. " skip closing quote
              EXIT. " DO
            ELSE.
              " Found escape quote instead
              position = following_position + 1.
              field = field && me->delimiter.
              " continue searching
            ENDIF.

            " ASSERT sy-index < 1000.
         ENDDO.
      ELSE.
        " Unescaped field, simply find the ending comma or newline
        position = find_any_of( val = i_string off = position + 1 sub = separators ).

        IF position = -1.
          position = string_length.
        ENDIF.

        field = substring( val = i_string off = start_position len = position - start_position ).
      ENDIF.

      APPEND field TO line.


      " Check if line ended and new line is started
      DATA(current) = substring( val = i_string off = position len = 2 ).
      IF current = me->line_separator.
       APPEND line TO r_result.
       CLEAR line.
       position = position + 2. " skip newline
      ELSE.
        " ASSERT i_string+position(1) = me->separator.
        position = position + 1.
      ENDIF.


      " Check if file ended
      IF position >= string_length.
        RETURN.
      ENDIF.

      " ASSERT sy-index < 100000001.
    ENDDO.

  ENDMETHOD.
ENDCLASS.

As a sidenote, instead of creating a huge table of string fields as stated in #1, I would experiment with some kind of "visitor pattern", e.g. pass an instance of such an interface to the parser:

INTERFACE if_csv_visitor.
  METHODS begin_line.
  METHODS end_line.
  METHODS visit_field
    IMPORTING
      i_field TYPE string.
ENDINTERFACE.

As in a lot of cases you'll write the CSV fields into a structure anyways, and thus one can save allocating this quite large table.


And for further reference, here's the whole report:

*&---------------------------------------------------------------------*
*& Report Z_CSV
*&---------------------------------------------------------------------*
*&
*&---------------------------------------------------------------------*
REPORT Z_CSV.

* --------------------- Generic CSV Parser ----------------------------*

CLASS lcl_csv_parser DEFINITION ABSTRACT.

  PUBLIC SECTION.
    TYPES:
      t_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.

    METHODS:
      parse ABSTRACT
        IMPORTING
          i_string       TYPE string
        RETURNING
          VALUE(r_result) TYPE t_string_matrix,
      constructor
        IMPORTING
          i_delimiter      TYPE string DEFAULT  '"'
          i_separator      TYPE string DEFAULT  ','
          i_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf.

  PROTECTED SECTION.
    DATA:
      delimiter         TYPE string,
      separator         TYPE string,
      line_separator    TYPE string,
      escaped_delimiter TYPE string,
      separators        TYPE string.

ENDCLASS.

CLASS lcl_csv_parser IMPLEMENTATION.
  METHOD constructor.
    me->delimiter = i_delimiter.
    me->separator = i_separator.
    me->line_separator = i_line_separator.
    me->escaped_delimiter = |{ i_delimiter }{ i_delimiter }|.
    me->separators = i_separator && i_line_separator.
  ENDMETHOD.
ENDCLASS.


* --------------------------- Line based CSV Parser ------------------------ *

CLASS lcl_csv_parser_line DEFINITION INHERITING FROM lcl_csv_parser.
  PUBLIC SECTION.
    METHODS parse REDEFINITION.

  PRIVATE SECTION.
    METHODS parse_line_to_string_table
      IMPORTING
        i_line         TYPE string
      RETURNING
        VALUE(r_result) TYPE string_table.
ENDCLASS.


CLASS lcl_csv_parser_line IMPLEMENTATION.
  METHOD parse.
    "get the lines
    SPLIT i_string AT me->line_separator INTO TABLE DATA(lines).
    DATA open_line TYPE abap_bool VALUE abap_false.
    DATA current_line TYPE string.

    LOOP AT lines ASSIGNING FIELD-SYMBOL(<line>).

      FIND ALL OCCURRENCES OF me->delimiter IN <line> IN CHARACTER MODE MATCH COUNT DATA(count).
      IF ( count MOD 2 )  = 1.
        IF open_line = abap_true.
          current_line = |{ current_line }{ me->line_separator }{ <line> }|.
          open_line = abap_false.
          APPEND parse_line_to_string_table( current_line ) TO r_result.
        ELSE.
          current_line = <line>.
          open_line = abap_true.
        ENDIF.
      ELSE.
        IF open_line = abap_true.
          current_line = |{ current_line }{ me->line_separator }{ <line> }|.
        ELSE.
          APPEND parse_line_to_string_table( <line> ) TO r_result.
        ENDIF.

      ENDIF.
    ENDLOOP.

  ENDMETHOD.


  METHOD parse_line_to_string_table.
    SPLIT i_line AT me->separator INTO TABLE DATA(fields).

    DATA open_field TYPE abap_bool VALUE abap_false.
    DATA current_field TYPE string.

    LOOP AT fields ASSIGNING FIELD-SYMBOL(<field>).
      FIND ALL OCCURRENCES OF me->delimiter IN <field> IN CHARACTER MODE MATCH COUNT DATA(count).
      IF ( count MOD 2 ) = 1.
        IF open_field = abap_true.
          current_field = |{ current_field }{ me->separator }{ <field> }|.
          open_field = abap_false.
          APPEND current_field TO r_result.
        ELSE.
          current_field = <field>.
          open_field = abap_true.
        ENDIF.
      ELSE.
        IF open_field = abap_true.
          current_field = |{ current_field }{ me->separator }{ <field> }|.
        ELSE.
          APPEND <field> TO r_result.
        ENDIF.
      ENDIF.

    ENDLOOP.

    REPLACE ALL OCCURRENCES OF me->escaped_delimiter IN TABLE r_result WITH me->delimiter.

  ENDMETHOD.

ENDCLASS.

*--------------- Find based CSV Parser ------------------------------------*

CLASS lcl_csv_parser_find DEFINITION INHERITING FROM lcl_csv_parser.
  PUBLIC SECTION.
    METHODS parse REDEFINITION.

ENDCLASS.

CLASS lcl_csv_parser_find IMPLEMENTATION.
  METHOD parse.
    DATA line TYPE string_table.
    DATA position TYPE i.
    DATA(string_length) = strlen( i_string ).

    " Dereferencing member fields is slightly slower than variable access, in a close loop this matters
    DATA(separators) = me->separators.
    DATA(delimiter)  = me->delimiter.

    CHECK string_length <> 0.

    " Checking for delimiters in the DO loop is quite slow.
    " By scanning the whole file once and skipping that check if no delimiter is present
    " This lead to a slight performance increase of 1s for 1 million rows
    DATA(next_delimiter) = find( val = i_string sub = delimiter ).

    DO.
      DATA(start_position) = position.
      DATA(field) = ``.
      " Check if field is enclosed in double quotes, as we need to unescape then
      IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
         start_position = start_position + 1. " literal starts after opening quote

         DO.
            position = find( val = i_string off = position + 1 sub = delimiter ).
            " literal must be closed
            " ASSERT position <> -1.

            DATA(subliteral_length) = position - start_position.
            field = field && substring( val = i_string off = start_position len = subliteral_length ).

            DATA(following_position) = position + 1.
            IF position = string_length OR i_string+following_position(1) <> delimiter.
              " End of literal is reached
              position = position + 1. " skip closing quote
              EXIT. " DO
            ELSE.
              " Found escape quote instead
              position = following_position + 1.
              field = field && me->delimiter.
              " continue searching
            ENDIF.

            " ASSERT sy-index < 1000.
         ENDDO.
      ELSE.
        " Unescaped field, simply find the ending comma or newline
        position = find_any_of( val = i_string off = position + 1 sub = separators ).

        IF position = -1.
          position = string_length.
        ENDIF.

        field = substring( val = i_string off = start_position len = position - start_position ).
      ENDIF.

      APPEND field TO line.


      " Check if line ended and new line is started
      DATA(current) = substring( val = i_string off = position len = 2 ).
      IF current = me->line_separator.
       APPEND line TO r_result.
       CLEAR line.
       position = position + 2. " skip newline
      ELSE.
        " ASSERT i_string+position(1) = me->separator.
        position = position + 1.
      ENDIF.


      " Check if file ended
      IF position >= string_length.
        RETURN.
      ENDIF.

      " ASSERT sy-index < 100000001.
    ENDDO.

  ENDMETHOD.
ENDCLASS.

* -------------------- Tests -------------------------------------------------------- *

CLASS lcl_test_csv_parser DEFINITION
  FINAL
  CREATE PUBLIC .

  PUBLIC SECTION.
    CLASS-METHODS run.
    CLASS-METHODS get_file_complex
      RETURNING VALUE(r_result) TYPE string.
    CLASS-METHODS get_file_simple
      RETURNING VALUE(r_result) TYPE string.
    CLASS-METHODS get_file_long
      RETURNING VALUE(r_result) TYPE string.
    CLASS-METHODS get_file_longer
      RETURNING VALUE(r_result) TYPE string.
    CLASS-METHODS get_file_mixed
      RETURNING VALUE(r_result) TYPE string.



  PROTECTED SECTION.
  PRIVATE SECTION.

ENDCLASS.



CLASS lcl_test_csv_parser IMPLEMENTATION.

  METHOD get_file_complex.
    DATA(file_line) =
      repeat( val = |"1234,{ cl_abap_char_utilities=>cr_lf }7890",| occ = 9 ) &&
      |"1234,{ cl_abap_char_utilities=>cr_lf }7890"| &&
      cl_abap_char_utilities=>cr_lf.

    r_result = repeat( val = file_line occ = 1000000 ).
  ENDMETHOD.

  METHOD get_file_simple.
    DATA(file_line) =
      repeat( val = |1234567890,| occ = 9 ) &&
      |1234567890| &&
      cl_abap_char_utilities=>cr_lf.

    r_result = repeat( val = file_line occ = 1000000 ).
  ENDMETHOD.

  METHOD get_file_long.
    DATA(file_line) =
      repeat( val = |12345678901234567890,| occ = 4 ) &&
      |12345678901234567890| &&
      cl_abap_char_utilities=>cr_lf.

    r_result = repeat( val = file_line occ = 1000000 ).
  ENDMETHOD.

  METHOD get_file_longer.
    DATA(file_line) =
      repeat( val = |1234567890123456789012345678901234567890,| occ = 2 ) &&
      |1234567890123456789012345678901234567890| &&
      cl_abap_char_utilities=>cr_lf.

    r_result = repeat( val = file_line occ = 1000000 ).
  ENDMETHOD.


  METHOD get_file_mixed.
    DATA(file_line) =
      |1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890| &&
      cl_abap_char_utilities=>cr_lf.

    r_result = repeat( val = file_line occ = 1000000 ).
  ENDMETHOD.



  METHOD run.
    DATA prepare_start TYPE timestampl.
    GET TIME STAMP FIELD prepare_start.

    TYPES:
      BEGIN OF t_file,
        name    TYPE string,
        content TYPE string,
      END OF t_file,
      t_files TYPE STANDARD TABLE OF t_file WITH EMPTY KEY.
    DATA(files) = VALUE t_files(
     ( name = `simple`  content = get_file_simple( )  )
     ( name = `long`    content = get_file_long( )    )
     ( name = `longer`  content = get_file_long( )    )
     ( name = `complex` content = get_file_complex( ) )
     ( name = `mixed`   content = get_file_mixed( )   )
    ).

    DATA prepare_end TYPE timestampl.
    GET TIME STAMP FIELD prepare_end.
    WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = prepare_end tstmp2 = prepare_start ) }|. SKIP 2.

    WRITE: 'File', 15 'Line Parse', 30 'Find Parse', 45 'Match'. NEW-LINE.
    ULINE.

    LOOP AT files INTO DATA(file).

      WRITE file-name UNDER 'File'.
      DATA line_start TYPE timestampl.
      GET TIME STAMP FIELD line_start.

      DATA(line_parser) = NEW lcl_csv_parser_line(  ).
      DATA(line_result) = line_parser->parse( file-content ).

      DATA line_end TYPE timestampl.
      GET TIME STAMP FIELD line_end.
      WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = line_end tstmp2 = line_start ) }s| UNDER 'Line Parse'.


      DATA find_start TYPE timestampl.
      GET TIME STAMP FIELD find_start.

      DATA(find_parser) = NEW lcl_csv_parser_find(  ).
      DATA(find_result) = find_parser->parse( file-content ).

      DATA find_end TYPE timestampl.
      GET TIME STAMP FIELD find_end.
      WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = find_end tstmp2 = find_start ) }s| UNDER 'Find Parse'.

      " WRITE COND #( WHEN line_result = find_result THEN 'yes' ELSE 'no') UNDER 'Match'.
      NEW-LINE.
    ENDLOOP.
  ENDMETHOD.



ENDCLASS.

START-OF-SELECTION.
  lcl_test_csv_parser=>run( ).