Redshift. Convert comma delimited values into rows

34

votes

I am wondering how to convert comma-delimited values into rows in Redshift. I am afraid that my own solution isn't optimal. Please advise. I have table with one of the columns with coma-separated values. For example:

I have:

user_id|user_name|user_action
-----------------------------
1      | Shone   | start,stop,cancell...

I would like to see

user_id|user_name|parsed_action 
------------------------------- 
1      | Shone   | start        
1      | Shone   | stop         
1      | Shone   | cancell      
....

amazon-web-servicesamazon-redshift

Please see a viable answer to this question here https://stackguides.com/questions/46784721/redshift-split-single-dynamic-column-into-multiple-rows-in-new-table/46785509#46785509 - Jon Scott

32

votes

A slight improvement over the existing answer is to use a second "numbers" table that enumerates all of the possible list lengths and then use a cross join to make the query more compact.

Redshift does not have a straightforward method for creating a numbers table that I am aware of, but we can use a bit of a hack from https://www.periscope.io/blog/generate-series-in-redshift-and-mysql.html to create one using row numbers.

Specifically, if we assume the number of rows in cmd_logs is larger than the maximum number of commas in the user_action column, we can create a numbers table by counting rows. To start, let's assume there are at most 99 commas in the user_action column:

select 
  (row_number() over (order by true))::int as n
into numbers
from cmd_logs
limit 100;

If we want to get fancy, we can compute the number of commas from the cmd_logs table to create a more precise set of rows in numbers:

select
  n::int
into numbers
from
  (select 
      row_number() over (order by true) as n
   from cmd_logs)
cross join
  (select 
      max(regexp_count(user_action, '[,]')) as max_num 
   from cmd_logs)
where
  n <= max_num + 1;

Once there is a numbers table, we can do:

select
  user_id, 
  user_name, 
  split_part(user_action,',',n) as parsed_action 
from
  cmd_logs
cross join
  numbers
where
  split_part(user_action,',',n) is not null
  and split_part(user_action,',',n) != '';

2

votes

Another idea is to transform your CSV string into JSON first, followed by JSON extract, along the following lines:

... '["' || replace( user_action, '.', '", "' ) || '"]' AS replaced

... JSON_EXTRACT_ARRAY_ELEMENT_TEXT(replaced, numbers.i) AS parsed_action

Where "numbers" is the table from the first answer. The advantage of this approach is the ability to use built-in JSON functionality.

0

votes

You can get the expected result with the following query. I'm using "UNION ALL" to convert a column to row.

select user_id, user_name, split_part(user_action,',',1) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',2) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',3) as parsed_action from cmd_logs

0

votes

Here's my equally-terrible answer.

I have a users table, and then an events table with a column that is just a comma-delimited string of users at said event. eg

event_id | user_ids
1        | 5,18,25,99,105

In this case, I used the LIKE and wildcard functions to build a new table that represents each event-user edge.

SELECT e.event_id, u.id as user_id
FROM events e
LEFT JOIN users u ON e.user_ids like '%' || u.id || '%'

It's not pretty, but I throw it in a WITH clause so that I don't have to run it more than once per query. I'll likely just build an ETL to create that table every night anyway.

Also, this only works if you have a second table that does have one row per unique possibility. If not, you could do LISTAGG to get a single cell with all your values, export that to a CSV and reupload that as a table to help.

Like I said: a terrible, no-good solution.

0

votes

Late to the party but I got something working (albeit very slow though)

with nums as (select n::int n
from
  (select 
      row_number() over (order by true) as n
   from table_with_enough_rows_to_cover_range)
cross join
  (select 
      max(json_array_length(json_column)) as max_num 
   from table_with_json_column )
where
  n <= max_num + 1)
select *, json_extract_array_element_text(json_column,nums.n-1) parsed_json
from  nums, table_with_json_column
where json_extract_array_element_text(json_column,nums.n-1) != ''
and nums.n <= json_array_length(json_column)

Thanks to answer by Bob Baxley for inspiration

0

votes

Just improvement for the answer above https://stackoverflow.com/a/31998832/1265306

Is generating numbers table using the following SQL https://discourse.looker.com/t/generating-a-numbers-table-in-mysql-and-redshift/482

SELECT 
  p0.n 
  + p1.n*2 
  + p2.n * POWER(2,2) 
  + p3.n * POWER(2,3)
  + p4.n * POWER(2,4)
  + p5.n * POWER(2,5)
  + p6.n * POWER(2,6)
  + p7.n * POWER(2,7) 
  as number  
INTO numbers
FROM  
  (SELECT 0 as n UNION SELECT 1) p0,  
  (SELECT 0 as n UNION SELECT 1) p1,  
  (SELECT 0 as n UNION SELECT 1) p2, 
  (SELECT 0 as n UNION SELECT 1) p3,
  (SELECT 0 as n UNION SELECT 1) p4,
  (SELECT 0 as n UNION SELECT 1) p5,
  (SELECT 0 as n UNION SELECT 1) p6,
  (SELECT 0 as n UNION SELECT 1) p7
ORDER BY 1
LIMIT 100

"ORDER BY" is there only in case you want paste it without the INTO clause and see the results

0

votes

create a stored procedure that will parse string dynamically and populatetemp table, select from temp table.

here is the magic code:-

  CREATE OR REPLACE PROCEDURE public.sp_string_split( "string" character varying )
AS $$
DECLARE 
  cnt INTEGER := 1;
    no_of_parts INTEGER := (select REGEXP_COUNT ( string , ','  ));
    sql VARCHAR(MAX) := '';
    item character varying := '';
BEGIN

  -- Create table
  sql := 'CREATE TEMPORARY TABLE IF NOT EXISTS split_table (part VARCHAR(255)) ';
  RAISE NOTICE 'executing sql %', sql ;
  EXECUTE sql;

  <<simple_loop_exit_continue>>
  LOOP
    item = (select split_part("string",',',cnt)); 
    RAISE NOTICE 'item %', item ;
    sql := 'INSERT INTO split_table SELECT '''||item||''' ';
    EXECUTE sql;
    cnt = cnt + 1;
    EXIT simple_loop_exit_continue WHEN (cnt >= no_of_parts + 2);
  END LOOP;

END ;
$$ LANGUAGE plpgsql;

Usage example:-

  call public.sp_string_split('john,smith,jones');
select *
from split_table

0

votes

If you know that there are not many actions in your user_action column, you use recursive sub-querying with union all and therefore avoiding the aux numbers table.

But it requires you to know the number of actions for each user, either adjust initial table or make a view or a temporary table for it.

Data preparation

Assuming you have something like this as a table:

create temporary table actions 
(
    user_id varchar,
    user_name varchar,
    user_action varchar
);

I'll insert some values in it:

insert into actions
values (1, 'Shone', 'start,stop,cancel'),
       (2, 'Gregory', 'find,diagnose,taunt'),
       (3, 'Robot', 'kill,destroy');

Here's an additional table with temporary count

create temporary table actions_with_counts 
(
    id varchar,
    name varchar,
    num_actions integer,
    actions varchar
);

insert into actions_with_counts (
    select user_id,
           user_name,
           regexp_count(user_action, ',') + 1 as num_actions,
           user_action
    from actions
);

This would be our "input table" and it looks just as you expected

select * from actions_with_counts;

id	name	num_actions	actions
2	Gregory	3	find,diagnose,taunt
3	Robot	2	kill,destroy
1	Shone	3	start,stop,cancel

Again, you can adjust initial table and therefore skipping adding counts as a separate table.

Sub-query to flatten the actions

Here's the unnesting query:

with recursive tmp (user_id, user_name, idx, user_action) as 
(
    select id,
           name,
           1 as idx,
           split_part(actions, ',', 1) as user_action
    from actions_with_counts

    union all

    select user_id,
           user_name,
           idx + 1 as idx,
           split_part(actions, ',', idx + 1)
    from actions_with_counts
         join tmp on actions_with_counts.id = tmp.user_id
    where idx < num_actions
)
select user_id, user_name, user_action as parsed_action
from tmp
order by user_id;

This will create a new row for each action, and the output would look like this:

user_id	user_name	parsed_action
1	Shone	start
1	Shone	stop
1	Shone	cancel
2	Gregory	find
2	Gregory	diagnose
2	Gregory	taunt
3	Robot	kill
3	Robot	destroy

-4

votes

You can try copy command to copy your file into redshift tables

copy table_name from 's3://mybucket/myfolder/my.csv' CREDENTIALS 'aws_access_key_id=my_aws_acc_key;aws_secret_access_key=my_aws_sec_key' delimiter ','

You can use delimiter ',' option.

For more details of copy command options you can visit this page

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Redshift. Convert comma delimited values into rows

9 Answers