0
votes

I am trying to execute a query to select all rows that fall within ranges defined by start and end columns in another table. For example, with pseudo-code, if I had these (very small) tables:

ranges:
    group_id = c("a", "b", "c", "d"),
    start = c(1, 7, 2, 25),
    end = c(5, 23, 7, 29)

positions:
    position = 100 random numbers
    annotation = 100 random strings

I would like to make a query that would return something like:

group_id  position  annotation
a         2         adfkjdas
a         3         sdlfkjasl;kdfj
b         9         sdlfkdj
c         5         wwlekrj
d         27        zxcvzx

Using MariaDB/MySQL, a BETWEEN query will operate row-wise over ranges, so this would work:

SELECT
      ranges.group_id as group_id,
      positions.position as position,
      positions.annotation as annotation
    FROM
      (SELECT * FROM my_ranges) AS ranges, positions
    WHERE
      positions.position BETWEEN ranges.start AND ranges.end

That is, the query acts as if the WHERE clause is actually a series of WHERE clauses joined by "OR", one for each row of the ranges table (e.g. BETWEEN 1 AND 5 OR BETWEEN 7 and 23 OR BETWEEN 2 AND 7 OR BETWEEN 25 AND 29).

It seems like the BETWEEN operator behaves differently in presto, so the same query does not return any results.

In realty, my ranges table has ~20,000 ranges I'd like to query, so joining them by writing OR statements seems prohibitive...

Can anyone here suggest a way to modify this query (or my general approach!) to work with Presto?

(added in response to comment): For more SQL than pseudo-code, I'd like to

use tables like this:
CREATE TABLE IF NOT EXISTS `ranges` (
  `group_id` char,
  `start` int(3),
  `end` int(3)
);

INSERT INTO `ranges` (`group_id`, `start`, `end`) VALUES
  ('a', '2', '5'),
  ('b', '7', '23'),
  ('c', '2', '7'),
  ('d', '25', '29');

CREATE TABLE IF NOT EXISTS `positions` (
  `position` int(3),
  `annotation` varchar(20)
);

INSERT INTO `positions` (`position`, `annotation`) VALUES
  ('2', 'adfkjdas'),
  ('3', 'sdlfkjasl;kdfj'),
  ('5', 'wwlekrj'),
  ('9', 'sdlfkdj'),
  ('27', 'zxcvzx');

And run queries like this:

SELECT
  group_id,
  position,
  annotation
FROM
  ranges, positions
WHERE
  positions.position BETWEEN ranges.start AND ranges.end
1
I'd love to assist, but it's not easy to understand your table data and requirements. If possible, please edit your question and show some rows in the input tables so we can try and run a query ourselves to test your scenario. You could also use SQL Fiddle to create a sample dataset. - John Rotenstein
Thanks, @JohnRotenstein - that's a useful site! - bheavner

1 Answers

2
votes

The following worked for me. I had to workaround the fact that end is a reserved word:

CREATE EXTERNAL TABLE IF NOT EXISTS ranges ( 
  group_id string,
  start_value int,
  end_value int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://my-bucket/ranges/';

CREATE EXTERNAL TABLE IF NOT EXISTS positions ( 
  position int,
  annotation string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://my-bucket/positions/';

SELECT
  group_id,
  position,
  annotation
FROM
  ranges, positions
WHERE
  positions.position BETWEEN ranges.start_value AND ranges.end_value;

The ranges and positions directories contained CSV files:

a,2,5
b,7,23
c,2,7
d,25,29

and

2,adfkjdas
3,sdlfkjaslkdfj
5,wwlekrj
9,sdlfkdj
27,zxcvzx