Trying to parse data out in Oracle using regexp_substr

Question

So I am trying to parse out data from this table that has a single identifier (file_name) and there are different attributes such as first_name, last_name, date_of_birth. The catch here is that the other non-identifier columns (such as first_name, last_name, etc.) have multiple entries delimited by im guessing a chr(10). Some of these fields can also be null (as DOB is shown below) So for example: The original data will have: filename.xml johnjacob schmidtmiller, 1-02-03 i need it to become: filename.xml john schmidt 1-02-03 filename.xml jacob miller null

What i have is as follows:

    select file_name
          , regexp_substr(first_name, '([^chr(10)]*)(chr(10)|$)', 1, level, null, 1) as first_name
         , regexp_substr(last_name, '([^chr(10)]*)(chr(10)|$)', 1, level, null, 1) as last_name
         , regexp_substr(dob, '([^chr(10)]*)(chr(10)|$)', 1, level, null, 1) as dob
     from TABLE
  connect by level <= regexp_count(first_name, chr(10)) + 1
      and prior file_name = file_name
      and prior sys_guid() is not null
    order by file_name

I am currently only returning the first first name / last name / dob etc. for each file name. What am I doing wrong?

Thanks in advance!

mathguy mathguy · Accepted Answer · 2017-11-01T17:51:11

Please find (google) and read the documentation for the REGEXP_SUBSTR function. You will see that it states explicitly that, by default, it treats the beginning and the end of a LINE as the beginning and the end of the whole input string. This default behavior can be overridden, in the fifth argument (which right now is NULL in your code). Namely, change NULL to 'm' (in all the calls to the function) and run again.

From the documentation:

'm' treats the source string as multiple lines. Oracle interprets ^ and $ as the start and end, respectively, of any line anywhere in the source string, rather than only at the start or end of the entire source string. If you omit this parameter, Oracle treats the source string as a single line.

Added: Moreover, when we want chr(10) to represent the line feed character, it should NOT be within quoted substrings.

[^chr(10)]

(within a quoted string) doesn't mean any character OTHER THAN LF (line feed). Instead, it means any character OTHER THAN c, h, r, 1, 0, ( and ).

Rather, the search pattern must alternate between quoted strings and CHR(10) OUTSIDE quotes, using concatenation.

'([^chr(10)]*)(chr(10)|$)'

should, instead, be

'([^'  ||  chr(10)  ||  ']*)('  ||  chr(10)  ||  '|$)'

Note also, importantly, that we can be liberal (flexible) with spaces OUTSIDE quoted strings (for legibility), but not WITHIN the strings. The single-quote after the caret symbol ^, for example, must be IMMEDIATELY after the caret; if we add a space there, "for legibility", the search pattern will become incorrect. The concatenation operators || are outside the quoted fragments, so we can add spaces around them liberally.

Trying to parse data out in Oracle using regexp_substr

1 Answers