Fetch text between delimiter using regex in oracle

votes

I'm getting a text oracle enclosed between delimiters. If possible, please help in creating a Regex for the text. I've an example of text

12322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!||

Till now I'm only able to fetch:

||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!

using this (\|\|(.*))+([^\|\|]).

But I need this data to be separated from || and then split from !!. After which I need to save it into an array like this:

array[1]= (123,word1 ,word2, word3)

array[2]=(789,word4,word5 , word6)

array[3]=(2345 ,word7,word8, 890)

regexdatabaseoracleoracle11gexpression

Try using REGEXP_REPLACE and match the pattern. - Lalit Kumar B

4 Answers

votes

This one should work:

with v1 as
(
  select '12322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!||' t from dual
)
select level -1 id, trim(',' from regexp_replace(regexp_substr(t,'[^\|]+',1,level),'!!',',')) array from v1
where level > 1
connect by level <= regexp_count(t,'\|\|');

Output:

        ID ARRAY
---------- --------------------------
         1 123,word1  ,word2, word3
         2 789,word4,word5  , word6
         3 2345  ,word7,word8, 890

And if number of parts is constant (4) and You want them in separate columns:

with v1 as
(
      select '12322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!||' t from dual
    ), v2 as
    (
    select level -1 id, trim(',' from regexp_replace(regexp_substr(t,'[^\|]+',1,level),'!!',',')) array
    from v1
    where level > 1
    connect by level <= regexp_count(t,'\|\|')
    )
    select id,
    regexp_substr(array,'[^,]+',1,1) val1,
    regexp_substr(array,'[^,]+',1,2) val2,
    regexp_substr(array,'[^,]+',1,3) val3,
    regexp_substr(array,'[^,]+',1,4) val4
    from v2;

Output:

        ID VAL1       VAL2       VAL3       VAL4
---------- ---------- ---------- ---------- ----------
         1 123        word1      word2       word3
         2 789        word4      word5       word6
         3 2345       word7      word8       890

PLSQL STYLE:

declare 
  type t_text_array is table of varchar2(4000);
  v_text_array t_text_array := t_text_array();
  val varchar2(4000);
  cursor c1 is 
  select '12322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!||' t from dual;
begin
  open c1;
  fetch c1 bulk collect into v_text_array;
  for i in 1..v_text_array.count loop
    for j in 2..regexp_count(v_text_array(i),'\|\|') loop
      val := trim(',' from regexp_replace(regexp_substr(v_text_array(i),'[^\|]+',1,j),'!!',','));
      for k in 1..regexp_count(val,',')+1 loop
        --display to console or further process...
        dbms_output.put_line(regexp_substr(val,'[^,]+',1,k));
      end loop;
    end loop;
  end loop;
end;
/

votes

The below one returns expected results:

with x as 
(select '2322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!||' str
 from dual),
 y as (
select regexp_substr(str,'[^||]+[!!]*', 1, level) str from x
    where level > 1
    connect by regexp_substr(str, '[^||]+[!!]*', 1, level) is not null
     )   
 select 
     regexp_replace (
      regexp_replace (
         regexp_replace(str, '^!!', '(') ,
          '!!$', ')'),   
          '[ ]*!![ ]*', ',')  str
   from y

votes

You need apply twice the split on delimiter as described here. Finally get the values (word) flat again using LISTAGG and finalize with some string concatenation.

I'm providing a complete example with two input records, so it can scale for any number of your parsed lines.

You may need to adjust the T2table limiting the number of splits. Some special handling is additionally needed if you can have NULL values in your keyword.

The query - commented below

WITH t1 AS
  (SELECT 1 id,
    '12322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!|| ' col
  FROM dual
  UNION ALL
  SELECT 2 id,
    '22222ACCCC12Y||!!567!!word21  !!word22!! word23!!||!!789!!word24!!word25  !! word26!!||!!2345  !!word27!!word28!! 890!!|| ' col
  FROM dual
  ),
  t2 AS
  (SELECT rownum colnum
  FROM dual
    CONNECT BY level < 10
    /* (max) number of columns */
  ),
  t3 AS
  (SELECT t1.id,
    t2.colnum,
    regexp_substr(t1.col,'[^|]+', 1, t2.colnum) col
  FROM t1,
    t2
  WHERE regexp_substr(t1.col, '[^|]+', 1, t2.colnum) IS NOT NULL
  ),
  first_split AS
  ( SELECT id, colnum, col FROM t3 WHERE col LIKE '%!!%'
  ),
  second_split AS
  (SELECT t1.id,
    t1.colnum linenum,
    t2.colnum,
    regexp_substr(t1.col,'[^!]+', 1, t2.colnum) col
  FROM first_split t1,
    t2
  WHERE regexp_substr(t1.col, '[^!]+', 1, t2.colnum) IS NOT NULL
  ),
  agg_values AS
  (SELECT id,
    linenum,
    LISTAGG(col, ',') WITHIN GROUP (
  ORDER BY colnum) val_lst
  FROM second_split
  GROUP BY id,
    linenum
  )
SELECT id,
  'array['
  || row_number() over (partition BY ID order by linenum)
  || ']= ('
  ||val_lst
  ||')' array_text
FROM agg_values
ORDER BY 1,2

Yields as requested

    ID ARRAY_TEXT   
     1 array[1]= (123, word1, word2, word3)   
     1 array[2]= (789, word4, word5, word6)      
     1 array[3]= (2345, word7, word8, 890)        
     2 array[1]= (567, word21, word22, word23)        
     2 array[2]= (789, word24, word25, word26)         
     2 array[3]= (2345, word27, word28, 890)

This is the result of the first_split query. You break the data in lines.

ID     COLNUM COL
---------- ---------- ------------------------------------------
1          2 !!123!!word1  !!word2!! word3!!
1          3 !!789!!word4!!word5  !! word6!!
1          4 !!2345  !!word7!!word8!! 890!!
2          2 !!567!!word21  !!word22!! word23!!
2          3 !!789!!word24!!word25  !! word26!!
2          4 !!2345  !!word27!!word28!! 890!!

The second_split query breaks the lines in word.

ID    LINENUM     COLNUM COL
---------- ---------- ---------- --------------------------------------------------------------------------------------------------------------------------
1          2          1 123
1          2          2 word1
1          2          3 word2
1          2          4 word3
1          3          1 789
1          3          2 word4
1          3          3 word5
.....

The rest is LISTAGG to get the csv keyword list and a ROW_NUMBER function to get nice sequential array_ids

If you want to extract the values in separate columns use PIVOT instead of LISTAGG. The drawback is that you must adjust the query for the actual number of the values.

WITH t1 AS
  (SELECT 1 id,
    '12322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!|| ' col
  FROM dual
  UNION ALL
  SELECT 2 id,
    '22222ACCCC12Y||!!567!!word21  !!word22!! word23!!||!!789!!word24!!word25  !! word26!!||!!2345  !!word27!!word28!! 890!!|| ' col
  FROM dual
  ),
  t2 AS
  (SELECT rownum colnum
  FROM dual
    CONNECT BY level < 10
    /* (max) number of columns */
  ),
  t3 AS
  (SELECT t1.id,
    t2.colnum,
    regexp_substr(t1.col,'[^|]+', 1, t2.colnum) col
  FROM t1,
    t2
  WHERE regexp_substr(t1.col, '[^|]+', 1, t2.colnum) IS NOT NULL
  ),
  first_split AS
  ( SELECT id, colnum, col FROM t3 WHERE col LIKE '%!!%'
  ),
  --select * from first_split order by 1,2,3;
  second_split AS
  (SELECT t1.id,
    t1.colnum linenum,
    t2.colnum,
    regexp_substr(t1.col,'[^!]+', 1, t2.colnum) col
  FROM first_split t1,
    t2
  WHERE regexp_substr(t1.col, '[^!]+', 1, t2.colnum) IS NOT NULL
  ),
  pivot_values AS
  (SELECT  *
  FROM second_split PIVOT (MAX(col) col FOR (colnum) IN (1 AS "K1", 2 AS "K2", 3 AS "K3", 4 AS "K4"))
  )
SELECT id,
  row_number() over (partition BY ID order by linenum) AS array_id,
  K1_COL,
  K2_COL,
  K3_COL,
  K4_COL
FROM pivot_values
ORDER BY 1,2;

gives the relational view

ID   ARRAY_ID K1_COL   K2_COL   K3_COL   K4_COL
---------- ---------- -------- -------- -------- --------
1          1 123      word1    word2     word3
1          2 789      word4    word5     word6
1          3 2345     word7    word8     890
2          1 567      word21   word22    word23
2          2 789      word24   word25    word26
2          3 2345     word27   word28    890

votes

Oracle Setup:

CREATE TABLE table_name ( id, value ) AS
SELECT 1, '12322ABCD124A||!!123!!word1  !!word2!! word3!!||!!789!!word4!!word5  !! word6!!||!!2345  !!word7!!word8!! 890!!||' FROM DUAL UNION ALL
SELECT 2, '12322ABCD124A||!!321!!word1a  !!word2a!! word3a!!||!!987!!word4a!!word5a  !! word6a!!||!!5432  !!word7a!!word8a!! 098!!||' FROM DUAL;

Query 1:

SELECT id,
       grp_no,
       CAST(
         MULTISET(
           SELECT  REGEXP_SUBSTR( t.text, '!\s*([^!]+?)\s*!', 1, LEVEL, NULL, 1 )
           FROM    DUAL
           CONNECT BY LEVEL <= REGEXP_COUNT( t.text, '!\s*([^!]+?)\s*!' )
         )
         AS SYS.ODCIVARCHAR2LIST
       ) AS words
FROM   (
  SELECT  id,
          COLUMN_VALUE AS grp_no,
          REGEXP_SUBSTR( value, '\|([^|]+)\|', 1, COLUMN_VALUE, NULL, 1 ) AS text
  FROM    table_name t,
          TABLE(
            CAST(
              MULTISET(
                SELECT  LEVEL
                FROM    DUAL
                CONNECT BY LEVEL <= REGEXP_COUNT( t.value, '\|([^|]+)\|' )
              )
              AS SYS.ODCINUMBERLIST
            )
          )
) t;

Results:

       ID     GRP_NO WORDS
---------- ---------- --------------------------------------------------------
         1          1 SYS.ODCIVARCHAR2LIST('123','word1','word2','word3')
         1          2 SYS.ODCIVARCHAR2LIST('789','word4','word5','word6')
         1          3 SYS.ODCIVARCHAR2LIST('2345','word7','word8','890')
         2          1 SYS.ODCIVARCHAR2LIST('321','word1a','word2a','word3a')
         2          2 SYS.ODCIVARCHAR2LIST('987','word4a','word5a','word6a')
         2          3 SYS.ODCIVARCHAR2LIST('5432','word7a','word8a','098')

Query 2:

SELECT id,
       grp_no,
       REGEXP_SUBSTR( t.text, '!\s*([^!]+)!', 1, 1, NULL, 1 ) AS Word1,
       REGEXP_SUBSTR( t.text, '!\s*([^!]+)!', 1, 2, NULL, 1 ) AS Word2,
       REGEXP_SUBSTR( t.text, '!\s*([^!]+)!', 1, 3, NULL, 1 ) AS Word3,
       REGEXP_SUBSTR( t.text, '!\s*([^!]+)!', 1, 4, NULL, 1 ) AS Word4
FROM   (
  SELECT  id,
          COLUMN_VALUE AS grp_no,
          REGEXP_SUBSTR( value, '\|([^|]+)\|', 1, COLUMN_VALUE, NULL, 1 ) AS text
  FROM    table_name t,
          TABLE(
            CAST(
              MULTISET(
                SELECT  LEVEL
                FROM    DUAL
                CONNECT BY LEVEL <= REGEXP_COUNT( t.value, '\|([^|]+)\|' )
              )
              AS SYS.ODCINUMBERLIST
            )
          )
) t;

Results:

ID   GRP_NO WORD1   WORD2   WORD3   WORD4
---- ------ ------- ------- ------- -------
   1      1 123     word1   word2   word3        
   1      2 789     word4   word5   word6 
   1      3 2345    word7   word8   890    
   2      1 321     word1a  word2a  word3a
   2      2 987     word4a  word5a  word6a   
   2      3 5432    word7a  word8a  098