I would like to parse big CSV files in ABAP in the most performant way under the following conditions:
- We do not know the structure of the CSV->the parse result should be table of string_table or something simular
- The parsing should happen in accordance to https://www.rfc-editor.org/rfc/rfc4180
- No solution specific calls
I found a very nice blog https://blogs.sap.com/2014/09/09/understanding-csv-files-and-their-handling-in-abap/ but it has its shortcoming:
- Write your own code - The code example is not sufficient
- Read the file using KCD_CSV_FILE_TO_INTERN_CONVERT - solution specific (not available everywhere) and will dump on fields that are big enough
- Use RTTI and dynamic programming along with FM RSDS_CONVERT_CSV - we do not know the structure in advance
- Use class CL_RSDA_CSV_CONVERTER - we do not know the structure in advance
I also checked the first available solution on github - https://github.com/thedoginthewok/ZwdCSV . Unfortunately, it has macros in the code (absolutely unacceptable) and also requires you to know the structure in advance.
I also tried to use the regex to do the job, but on big files this is too slow.
Even though I am extremely annoyed by this fact, I had to create a solution myself (I cannot believe that I actually did it - it should be in the standard...) My first solution was a direct copy paste of Java code into ABAP (https://mkyong.com/java/how-to-read-and-parse-csv-file-in-java/). Unfortunately, as my other question How to iterate over string characters in ABAP in performant way? shown, it is not that easy to iterate over string in abap as it is in Java. I then tried a split/count approach and so far it has the best performance. Does anyone knows the better way achieve this?
REPORT z_csv_test.
CLASS lcl_csv_parser DEFINITION CREATE PRIVATE.
PUBLIC SECTION.
TYPES:
tt_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
CLASS-METHODS:
create
IMPORTING
!iv_delimiter TYPE string DEFAULT '"'
!iv_separator TYPE string DEFAULT ','
!iv_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf
RETURNING
VALUE(r_result) TYPE REF TO lcl_csv_parser.
METHODS:
parse
IMPORTING
iv_string TYPE string
RETURNING
VALUE(r_result) TYPE tt_string_matrix,
constructor
IMPORTING
!iv_delimiter TYPE string
!iv_separator TYPE string
!iv_line_separator TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
DATA:
gv_delimiter TYPE string,
gv_separator TYPE string,
gv_line_separator TYPE string,
gv_escaped_delimiter TYPE string.
METHODS parse_line_to_string_table
IMPORTING
iv_line TYPE string
RETURNING
VALUE(r_result) TYPE string_table.
ENDCLASS.
CLASS lcl_csv_parser IMPLEMENTATION.
METHOD create.
r_result = NEW #(
iv_delimiter = iv_delimiter
iv_line_separator = CONV #( iv_line_separator )
iv_separator = iv_separator ).
ENDMETHOD.
METHOD constructor.
me->gv_delimiter = iv_delimiter.
me->gv_separator = iv_separator.
me->gv_line_separator = iv_line_separator.
me->gv_escaped_delimiter = |{ iv_delimiter }{ iv_delimiter }|.
ENDMETHOD.
METHOD parse.
"get the lines
SPLIT iv_string AT me->gv_line_separator INTO TABLE DATA(lt_lines).
DATA lx_open_line TYPE abap_bool VALUE abap_false.
DATA lv_current_line TYPE string.
LOOP AT lt_lines ASSIGNING FIELD-SYMBOL(<ls_line>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_line> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
lx_open_line = abap_false.
APPEND parse_line_to_string_table( lv_current_line ) TO r_result.
ELSE.
lv_current_line = <ls_line>.
lx_open_line = abap_true.
ENDIF.
ELSE.
IF lx_open_line = abap_true.
lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
ELSE.
APPEND parse_line_to_string_table( <ls_line> ) TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
ENDMETHOD.
METHOD parse_line_to_string_table.
SPLIT iv_line AT me->gv_separator INTO TABLE DATA(lt_line).
DATA lx_open_field TYPE abap_bool VALUE abap_false.
DATA lv_current_field TYPE string.
LOOP AT lt_line ASSIGNING FIELD-SYMBOL(<ls_field>).
FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_field> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
IF ( lv_count MOD 2 ) = 1.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
lx_open_field = abap_false.
APPEND lv_current_field TO r_result.
ELSE.
lv_current_field = <ls_field>.
lx_open_field = abap_true.
ENDIF.
ELSE.
IF lx_open_field = abap_true.
lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
ELSE.
APPEND <ls_field> TO r_result.
ENDIF.
ENDIF.
ENDLOOP.
REPLACE ALL OCCURRENCES OF me->gv_escaped_delimiter IN TABLE r_result WITH me->gv_delimiter.
ENDMETHOD.
ENDCLASS.
CLASS lcl_test_csv_parser DEFINITION
FINAL
CREATE PUBLIC .
PUBLIC SECTION.
CLASS-METHODS run.
CLASS-METHODS get_file
RETURNING VALUE(r_result) TYPE string.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS lcl_test_csv_parser IMPLEMENTATION.
METHOD get_file.
DATA lv_file_line TYPE string.
DO 10 TIMES.
lv_file_line = |"1234,{ cl_abap_char_utilities=>cr_lf }567890",{ lv_file_line }|.
ENDDO.
lv_file_line = lv_file_line && cl_abap_char_utilities=>cr_lf.
DATA(lt_file_as_table) = VALUE string_table(
FOR i = 1 THEN i + 1 UNTIL i = 1000000
( lv_file_line ) ).
CONCATENATE LINES OF lt_file_as_table INTO r_result.
ENDMETHOD.
METHOD run.
DATA lv_prepare_start TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_start.
DATA(lv_file) = get_file( ).
DATA lv_prepare_end TYPE timestampl.
GET TIME STAMP FIELD lv_prepare_end.
WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = lv_prepare_end tstmp2 = lv_prepare_start ) }|.
DATA lv_parse_start TYPE timestampl.
GET TIME STAMP FIELD lv_parse_start.
DATA(lo_parser) = lcl_csv_parser=>create( ).
DATA(lt_file) = lo_parser->parse( lv_file ).
DATA lv_parse_end TYPE timestampl.
GET TIME STAMP FIELD lv_parse_end.
WRITE |Parse took { cl_abap_tstmp=>subtract( tstmp1 = lv_parse_end tstmp2 = lv_parse_start ) }|.
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
lcl_test_csv_parser=>run( ).
|{ cl_abap_char_utilities=>cr_lf }|
can be written|\r\n|
. 2) When extending the stringlv_file_line
it's much more performing by placing the variable at first position (kernel optimization):lv_file_line = |{ lv_file_line }"1234,\r\n567890",|
(instead oflv_file_line = |"1234,\r\n567890",{ lv_file_line }|
). – Sandra Rossi