Working in Oracle 11g and using SQL functions (not PL/SQL, Java, etc.), I would like to extract nested XML data into columns from a CLOB field that is limited at 4000 characters.
Initially I developed a solution that used REGEXP_SUBSTR with CONNECTED BY LEVEL to extract and expand the data. This approach worked on a limited sample set of data, but when applied to a larger data set it took a prohibitively long time (several hours) and didn't return. (I assume this is because it generates a cross join and filters the result afterward?)
Now I have another solution using XMLTYPE and XMLTABLE that works for both the small and large sets, with one important caveat: any invalid XML entries in the data set causes the entire query to fail. In the large data set I have only observed this happening in the case where the LENGTH of the text in the CLOB field is equal to 4000. In those cases, a string exceeding 4000 characters was inserted into the CLOB and was truncated, thereby breaking the XML structure. The XMLTYPE conversion, therefore, fails on those rows and ends the query in an error. I have a crude workaround that uses a WHERE clause to filter out rows where the length of the CLOB >= 4000 to avoid that particular error. But that approach obviously doesn't allow for extraction of the available data in those rows, and, more importantly, hints at the fragility of a solution that relies on the input containing only valid XML. A sample query of this approach is included below and a SQL Fiddle can be found at the following link: http://sqlfiddle.com/#!4/d41d8/40794/0
SELECT
orig_string.message_seq_id,
xml_layer_1.unit_srl,
xml_layer_2.AB_Type
FROM (
SELECT
xml.message_seq_id,
XMLTYPE(
xml.xml_messg_text
) AS xml_messg_text
FROM (
SELECT 1001 AS message_seq_id, RPAD(Q'#<location_data unit_srl="MLRR-1" time_marker="02-04-2015 21:54:42" location_number="2335350"><RunData AB_Type="Rezoned" mr_cod="9982349482-2422422234" Timing="02-04-2015 21:54:42" Handle="L1L_upperband" LongCount="82742288322983299222759200000004" ClassTypeTest="HUCOHOTMMMWMM" Boc="242242" Closeout="LongOp" Founding="14" init_link="lls://rtg.ootir.orzz/prp"/>#', 3888, ' ') || Q'#<RunData AB_Type="Rezoned" mr_cod="9834398399934-39493943-###-09923049,0003-002,12" Timing="02-04-2015 21:54:42"#' AS xml_messg_text, TO_DATE('2015-02-04 21:06:00', 'YYYY-MM-DD HH24:MI:SS') AS create_date FROM DUAL UNION ALL
SELECT 1002 AS message_seq_id, Q'#<location_data unit_srl="UNTOUT-3" time_marker="02-05-2015 10:18:34" location_number="23422"><RunData AB_Type="Rezoned" mr_cod="9234092349,03940388,748209,23" Timing="02-05-2015 10:18:34" Handle="L1L_upperband" ClassTypeTest="URCHTPH" Boc="3245" Closeout="LongOp" Founding="21" init_link="lls://rtg.ootir.orzz/prp"/><RunData AB_Type="Rezoned" mr_cod="93488284983292492394242-####-0200234" Timing="02-05-2015 10:18:34" Handle="L1L_upperband" ClassTypeTest="BERDBP" Boc="6445" Closeout="LongOp" Founding="9453" init_link="lls://rtg.ootir.orzz/prp"/><RunData AB_Type="Lopped" mr_cod="4537747565656565" Timing="02-05-2015 10:18:34" Handle="L1L_upperband" ClassTypeTest="HIUFDDU" Boc="3324" Closeout="LongOp" Founding="29" init_link="lls://rtg.ootir.orzz/prp"/></location_data>#' AS xml_messg_text, TO_DATE('2015-02-05 09:31:23', 'YYYY-MM-DD HH24:MI:SS') AS create_date FROM DUAL UNION ALL
SELECT 1003 AS message_seq_id, Q'#<location_data unit_srl="LURET-22" time_marker="02-04-2015 15:34:55" location_number="1135461"><RunData Timing="02-04-2015 15:34:55" Boc="5644" Closeout="ChpstHousTul" init_link="lls://rtg.ootir.orzz/prp" Dest=""/></location_data>#' AS xml_messg_text, TO_DATE('2015-02-04 16:24:42', 'YYYY-MM-DD HH24:MI:SS') AS create_date FROM DUAL UNION ALL
SELECT 1004 AS message_seq_id, Q'#<location_data unit_srl="GBT-14" time_marker="02-04-2015 15:31:27" location_number="4395822"><RunData AB_Type="Rezoned" mr_cod="5948573853085309485938094853098345" Timing="02-04-2015 15:31:27" Handle="L1L_upperband" ClassTypeTest="BTTXRCH" Boc="232" Closeout="LongOp" Founding="3" init_link="lls://rtg.ootir.orzz/prp"/></location_data>#' AS xml_messg_text, TO_DATE('2015-02-04 16:35:37', 'YYYY-MM-DD HH24:MI:SS') AS create_date FROM DUAL
) xml
WHERE LENGTH(xml.xml_messg_text) < 4000
) orig_string,
XMLTABLE(
'/location_data'
PASSING orig_string.xml_messg_text
COLUMNS
unit_srl VARCHAR2(15) PATH '@unit_srl',
OBJECT XMLTYPE PATH 'RunData'
) xml_layer_1,
XMLTABLE(
'/RunData'
PASSING xml_layer_1.object
COLUMNS
AB_Type VARCHAR2(40) PATH '@AB_Type'
) xml_layer_2
;
Note that message_seq_id = 1001 is the the problem row, and if the WHERE clause is commented out the query results in an error.
Since my access to the database is READ ONLY and I won't be able to add table constraints or otherwise exclude malformed data any solution will need to be able to handle the data as is.
How can I resolve the XMLTYPE conversion problem so that all rows are accessible and don't cause the query to end in errors even if they contain invalid XML?
Is there a more robust approach to extracting the data than either of the two outlined above?