I am coming across a piece of code in Apache Hive like regexp_extract(input, '[0-9]*', 0), Can someone please explain to me what this code does? Thanks
3 Answers
21
votes
From the Hive manual DDL, it returns the string extracted using the pattern. e.g. regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
returns bar
.
The index
parameter is the capture group, which is an integer that can take the following values:
- 0: the entire match, in my example it would be
foothebar
- 1: the first group, in my example it would be
the
- 2: the second group, in my example it would be
bar
- n: the nth group. If n is bigger than the actual number of groups defined in your regexp, your Hive query will fail.
In your example, regexp_extract(input, '[0-9]*', 0)
, your are looking for the whole match for your column identified by input
and starting with a numerical value.
Here are a few examples:
regexp_extract('9eleven', '[0-9]*', 0)
-> returns9
regexp_extract('9eleven', '[0-9]*', 1)
-> query failsregexp_extract('911test', '[0-9]*', 0)
-> returns911
regexp_extract('911test', '[0-9]*', 1)
-> query failsregexp_extract('eleven', '[0-9]*', 0)
-> returns empty stringregexp_extract('test911', '[0-9]*', 0)
-> returns empty string
3
votes
-1
votes
The regular expression regexp_extract(input, '[0-9]*', 0)
will give all starting digits in the input string.
eg,
select regexp_extract('442323test41234', '[0-9]*', 0) ==> 442323
select regexp_extract('44test41234', '[0-9]*', 0) ==> 44
select regexp_extract('test41234', '[0-9]*', 0) ==> 1
But it's not the best way to extract starting numbers. The below is more readable format:
select regexp_extract(input, '^([0-9]+).*?$', 1)
The above regular expression says, list all the digits that are starting in the string and ignore the rest.