0
votes

Working in Oracle 11g and using SQL functions (not PL/SQL, Java, etc.), I would like to extract nested XML data into columns from a CLOB field that is limited at 4000 characters.

Initially I developed a solution that used REGEXP_SUBSTR with CONNECTED BY LEVEL to extract and expand the data. This approach worked on a limited sample set of data, but when applied to a larger data set it took a prohibitively long time (several hours) and didn't return. (I assume this is because it generates a cross join and filters the result afterward?)

Now I have another solution using XMLTYPE and XMLTABLE that works for both the small and large sets, with one important caveat: any invalid XML entries in the data set causes the entire query to fail. In the large data set I have only observed this happening in the case where the LENGTH of the text in the CLOB field is equal to 4000. In those cases, a string exceeding 4000 characters was inserted into the CLOB and was truncated, thereby breaking the XML structure. The XMLTYPE conversion, therefore, fails on those rows and ends the query in an error. I have a crude workaround that uses a WHERE clause to filter out rows where the length of the CLOB >= 4000 to avoid that particular error. But that approach obviously doesn't allow for extraction of the available data in those rows, and, more importantly, hints at the fragility of a solution that relies on the input containing only valid XML. A sample query of this approach is included below and a SQL Fiddle can be found at the following link: http://sqlfiddle.com/#!4/d41d8/40794/0

SELECT
            orig_string.message_seq_id,
            xml_layer_1.unit_srl,
            xml_layer_2.AB_Type

FROM        (

    SELECT
                xml.message_seq_id,
                XMLTYPE(
                    xml.xml_messg_text
                ) AS xml_messg_text

    FROM        (

        SELECT 1001 AS message_seq_id, RPAD(Q'#<location_data unit_srl="MLRR-1" time_marker="02-04-2015 21:54:42" location_number="2335350"><RunData AB_Type="Rezoned" mr_cod="9982349482-2422422234" Timing="02-04-2015 21:54:42" Handle="L1L_upperband" LongCount="82742288322983299222759200000004" ClassTypeTest="HUCOHOTMMMWMM" Boc="242242" Closeout="LongOp" Founding="14" init_link="lls://rtg.ootir.orzz/prp"/>#', 3888, ' ') || Q'#<RunData AB_Type="Rezoned" mr_cod="9834398399934-39493943-###-09923049,0003-002,12" Timing="02-04-2015 21:54:42"#' AS xml_messg_text, TO_DATE('2015-02-04 21:06:00', 'YYYY-MM-DD HH24:MI:SS') AS create_date  FROM DUAL   UNION ALL
        SELECT 1002 AS message_seq_id, Q'#<location_data unit_srl="UNTOUT-3" time_marker="02-05-2015 10:18:34" location_number="23422"><RunData AB_Type="Rezoned" mr_cod="9234092349,03940388,748209,23" Timing="02-05-2015 10:18:34" Handle="L1L_upperband" ClassTypeTest="URCHTPH" Boc="3245" Closeout="LongOp" Founding="21" init_link="lls://rtg.ootir.orzz/prp"/><RunData AB_Type="Rezoned" mr_cod="93488284983292492394242-####-0200234" Timing="02-05-2015 10:18:34" Handle="L1L_upperband" ClassTypeTest="BERDBP" Boc="6445" Closeout="LongOp" Founding="9453" init_link="lls://rtg.ootir.orzz/prp"/><RunData AB_Type="Lopped" mr_cod="4537747565656565" Timing="02-05-2015 10:18:34" Handle="L1L_upperband" ClassTypeTest="HIUFDDU" Boc="3324" Closeout="LongOp" Founding="29" init_link="lls://rtg.ootir.orzz/prp"/></location_data>#' AS xml_messg_text, TO_DATE('2015-02-05 09:31:23', 'YYYY-MM-DD HH24:MI:SS') AS create_date  FROM DUAL   UNION ALL
        SELECT 1003 AS message_seq_id, Q'#<location_data unit_srl="LURET-22" time_marker="02-04-2015 15:34:55" location_number="1135461"><RunData Timing="02-04-2015 15:34:55" Boc="5644" Closeout="ChpstHousTul" init_link="lls://rtg.ootir.orzz/prp" Dest=""/></location_data>#' AS xml_messg_text, TO_DATE('2015-02-04 16:24:42', 'YYYY-MM-DD HH24:MI:SS') AS create_date  FROM DUAL   UNION ALL
        SELECT 1004 AS message_seq_id, Q'#<location_data unit_srl="GBT-14" time_marker="02-04-2015 15:31:27" location_number="4395822"><RunData AB_Type="Rezoned" mr_cod="5948573853085309485938094853098345" Timing="02-04-2015 15:31:27" Handle="L1L_upperband" ClassTypeTest="BTTXRCH" Boc="232" Closeout="LongOp" Founding="3" init_link="lls://rtg.ootir.orzz/prp"/></location_data>#' AS xml_messg_text, TO_DATE('2015-02-04 16:35:37', 'YYYY-MM-DD HH24:MI:SS') AS create_date  FROM DUAL

    ) xml

    WHERE       LENGTH(xml.xml_messg_text) < 4000

) orig_string,

XMLTABLE(
    '/location_data'
    PASSING orig_string.xml_messg_text
    COLUMNS
        unit_srl VARCHAR2(15) PATH '@unit_srl',
        OBJECT XMLTYPE PATH 'RunData'
) xml_layer_1,

XMLTABLE(
    '/RunData'
    PASSING xml_layer_1.object
    COLUMNS
        AB_Type VARCHAR2(40) PATH '@AB_Type'
) xml_layer_2
;

Note that message_seq_id = 1001 is the the problem row, and if the WHERE clause is commented out the query results in an error.

Since my access to the database is READ ONLY and I won't be able to add table constraints or otherwise exclude malformed data any solution will need to be able to handle the data as is.

How can I resolve the XMLTYPE conversion problem so that all rows are accessible and don't cause the query to end in errors even if they contain invalid XML?

Is there a more robust approach to extracting the data than either of the two outlined above?

1
Maybe bad data shouldn't get into the database to start with? - OldProgrammer
@OldProgrammer that'd be my preference as well, but that is an aspect of the client system over which I have no control. - Track

1 Answers

0
votes

This isn't an answer to your original question, but it is a response to your comment to @OldProgrammer, only it's too long/complex to fit in a comment.

If you can't get the client system to stop passing in bad data, maybe you can at least put a constraint on the table to prevent the bad data from being added. Also, the resultant errors on the front end might actually force the client system coders to deal with their issue...

Eg:

create table test1 (col1 clob);

alter table test1 add constraint t1_chk check (col1 is null or length(xmltype(col1)) > 0);

-- works
insert into test1 values (null);

-- works
insert into test1 values ('<root><a>fred</a></root>');

-- fails to insert
insert into test1 values ('<root><a>fred</a>');

commit;

drop table test1;