R Regexp - extract number with 5 digits

8

votes

i have a string a like this one:

stundenwerte_FF_00691_19260101_20131231_hist.zip

and would like to extract the 5-digit number "00691" from it.

I tried using gregexpr and regmatches as well as stringr::str_extract but couldn't figute out the right rexexp. I came as far as:

gregexpr("[:digits{5}:]",a)

Which should return 5-digit-numbers and i dont understand how to fix it.
This does not work propperly :(

m <- gregexpr("[:digits{5}:]",a)
regmatches(a,m)

Thanks for your help in advance!

regexr

This (site)[regex101.com/] can help – user3969377

11

votes

You could simply use sub to grab the digits, IMO regmatches is not necessary for this simple case.

x <- 'stundenwerte_FF_00691_19260101_20131231_hist.zip'
sub('\\D*(\\d{5}).*', '\\1', x)
# [1] "00691"

Edit: If you have other strings that contain digits in front, you would slightly modify the expression.

sub('.*_(\\d{5})_.*', '\\1', x)

6

votes

1) sub

sub(".*_(\\d{5})_.*", "\\1", x)
## [1] "00691"

2) gsubfn::strapplyc The regexp can be slightly simplified if we use strapplyc:

library(gsubfn)

strapplyc(x, "_(\\d{5})_", simplify = TRUE)
## [1] "00691"

3) strsplit If we know that it is the third field:

read.table(text = x, sep = "_", colClasses = "character")$V3
## [1] "00691"

3a) or

strsplit(x, "_")[[1]][3]
## [1] "00691"

4

votes

You could try the below regex which uses negative lookaround assertions. We can't use word boundaries here like \\b\\d{5}\\b because the preceding and the following character _ comes under \w

> x <- "stundenwerte_FF_00691_19260101_20131231_hist.zip"
> m <- regexpr("(?<!\\d)\\d{5}(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] "00691"
> m <- gregexpr("(?<!\\d)\\d{5}(?!\\d)", x, perl=TRUE)
> regmatches(x, m)[[1]]
[1] "00691"

Explanation:

(?<!\\d) Negative lookbehind asserts that what precedes the match would be any but not a digit.
\\d{5} Match exactly 5 digits.
(?!\\d) Negative lookahead asserts that the character following the match would be any but not a digit.

1

votes

Let string be:

ss ="stundenwerte_FF_00691_19260101_20131231_hist.zip"

You can split the string and unlist the substrings:

ll = unlist(strsplit(ss,'_'))

Then get indexes of substrings set to TRUE if they are 5 characters long:

idx = sapply(ll, nchar)==5

And get the ones which are 5 characters long:

ll[idx]
[1] "00691"

R Regexp - extract number with 5 digits

4 Answers