4
votes

I am coming across a piece of code in Apache Hive like regexp_extract(input, '[0-9]*', 0), Can someone please explain to me what this code does? Thanks

3

3 Answers

21
votes

From the Hive manual DDL, it returns the string extracted using the pattern. e.g. regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns bar.

The index parameter is the capture group, which is an integer that can take the following values:

  • 0: the entire match, in my example it would be foothebar
  • 1: the first group, in my example it would be the
  • 2: the second group, in my example it would be bar
  • n: the nth group. If n is bigger than the actual number of groups defined in your regexp, your Hive query will fail.

In your example, regexp_extract(input, '[0-9]*', 0), your are looking for the whole match for your column identified by input and starting with a numerical value.

Here are a few examples:

  • regexp_extract('9eleven', '[0-9]*', 0) -> returns 9
  • regexp_extract('9eleven', '[0-9]*', 1) -> query fails
  • regexp_extract('911test', '[0-9]*', 0) -> returns 911
  • regexp_extract('911test', '[0-9]*', 1) -> query fails
  • regexp_extract('eleven', '[0-9]*', 0) -> returns empty string
  • regexp_extract('test911', '[0-9]*', 0) -> returns empty string
3
votes

The answer above is partially incorrect. regexp_extract('test911', '[0-9]*', 0) will certainly return 911. We never specify the start of the string before [0-9]*

-1
votes

The regular expression regexp_extract(input, '[0-9]*', 0) will give all starting digits in the input string.

eg,

select regexp_extract('442323test41234', '[0-9]*', 0) ==> 442323
select regexp_extract('44test41234', '[0-9]*', 0) ==> 44
select regexp_extract('test41234', '[0-9]*', 0) ==> 1

But it's not the best way to extract starting numbers. The below is more readable format:

select regexp_extract(input, '^([0-9]+).*?$', 1)

The above regular expression says, list all the digits that are starting in the string and ignore the rest.