1
votes

I would like to write a simple Oracle SQL 10g statement to compare the items (strings) in an arbitrarily defined comma separated list to a field containing a list of comma separated strings. The items in the defined list can appear in the field in any order, and exact matches need to be found (not substrings).

I have a working solution using a series of regexp_like() statements entered manually, but I need to hand this off to a client who will be maintaining this moving forward, and would like to be able to just update the comma separated string directly.

I also have some software GUI based limitations on what I can do with Oracle SQL to accomplish this. Specifically I can not use any PL/SQL and this must be written in a single select statement (no temporary tables or anything fun/useful like that.) I've found a number of solutions to what I'm trying to accomplish, but almost all depended on being able to write custom functions.

So, now that backstory/limitations are out of the way, let's get down to the nitty gritty.

Example arbitrary (client-provided) list: ItemA,ItemB,ItemC

Table ITEMS:
Column Items (varchar2 of some arbitrary but sufficient length)

  • ItemA,ItemB,ItemC
  • ItemC,ItemB,ItemA
  • ItemX,ItemC,ItemY,ItemA,ItemB,ItemB
  • ItemX,ItemY,ItemC

I want a single select statement that will basically select all rows where Items contains "ItemA" and "ItemB" and "ItemC", but without having to break that string up manually. In this case, it would match the first, second and third row, but not the fourth row.

(EDIT) I realize this table structure is badly designed. At this time I do not know if we can go back to the client to fix this as the data may be used as is elsewhere already, making a redesign would be costly and time consuming. I'm sure a lot of you are used to this scenario. The initial system was designed poorly, now I've been brought in to consult on difficulties arising from the poor design. Let's assume it is not possible to normalize this table, and must be used as is.

It is entirely possible that what I would like to do is simply not possible given the limitations of the interface I need to use, but my SQL knowledge is not great enough to determine that.

Thank you all very much for taking the time to read this question. Please let me know if anything is confusing or needs expansion or clarification.

1
You shouldn't be storing comma separated values in a single column in the first place. Can you fix your data model?a_horse_with_no_name
NORMALIZATION, the basic principle of Relational data model. Read about it first. Your design is flawed. The way you display data is NOT the way you store it in database.Lalit Kumar B
Sorry, I should have mentioned that this is poorly designed. I've only recently been brought in to help this client. I've added an edit explicitly acknowledging that this is almost entirely an issue due to poor design. At this time I do not know if it is possible to normalize this table due to time/cost involved.Strahn

1 Answers

3
votes

While I absolutely agree with commentators that the data model is flawed, sometimes you have to work with what you're given. If it is really impossible to change the data model then you can do this, but it isn't entirely pretty, and depends on your 'restrictions' not excluding the use of common table expressions - I've seen tools struggle with those...

with items_cte as (
  select id, regexp_substr(items, '[^,]+', 1, level) as item, level as pos
  from items
  connect by level <= regexp_count(items, '[^,]+')
  and prior id = id
  and prior sys_guid() is not null
),
list_cte as (
  select regexp_substr(:list, '[^,]+', 1, level) as item,
    count(*) over () as list_length
  from dual
  connect by level <= regexp_count(:list, '[^,]+')
)
select i.id, listagg(i.item, ',') within group (order by i.pos) as items
from items_cte i
join list_cte l
on l.item = i.item
group by i.id
having count(distinct i.item) = max(l.list_length)
order by i.id;

        ID ITEMS
---------- --------------------------------------------
         1 ItemA,ItemB,ItemC
         2 ItemC,ItemB,ItemA
         3 ItemC,ItemA,ItemB,ItemB

SQL Fiddle.

This is using two common table expressions (CTEs, also known as subquery factoring). They each split a comma-separated list up into pseudorows. The list breakdown is fairly simple and uses regex functions you seem to be familiar with. The items one is a bit more complicated because the connect by clause doesn't generally work very well with multiple rows. This uses a trick that uses the prior clause with any non-deterministic function - sys_guid() here but you can use others - to stop it cycling and mixing values from different original rows. I've also assumed you have a unique ID column on the table.

The Fiddle shows the two separate split results, as well as the final result from joining those.

Finally listagg is used to put the split values back together in their original orders, and the count(distinct i.item) check only shows results where all the values from the list were matched. The distinct is needed to match your third row, since itemB appears twice.