0
votes

I'm running a query to bifurcate splunk results into buckets. I want to divide and count files based on sizes they are taking on disk. This can be achieved using rangemap or eval case.

As I read here using eval is faster than rangemap. But I'm getting different results on using both.

This is the query I'm running -

<source> 
| eval size_group = case(SizeInMB < 150, "0-150 MB", SizeInMB < 200 AND SizeInMB >= 150, "150-200 MB", SizeInMB < 300 AND SizeInMB >= 200, "200-300 MB", SizeInMB < 500 AND SizeInMB >= 300, "300-500 MB", SizeInMB < 1000 AND SizeInMB >= 500, "500-1000 MB", SizeInMB > 1000, ">1000 MB") 
| stats count by size_group

and this is the result I'm getting -

enter image description here

Whereas using rangemap this is the query -

<source> 
| rangemap field=SizeInMB "0-150MB"=0-150 "151-200MB"=150-200 "201-300MB"=200-300 "301-500MB"=300-500 "501-999MB"=500-1000 default="1000MB+" 
| stats count by range

I tried this range too - rangemap field=SizeInMB "0-150MB"=0-150 "150-200MB"=150-200 "200-300MB"=200-300 "300-500MB"=300-500 "500-1000MB"=500-1000 default="1000MB+" and I get the same result -

enter image description here

There is not a huge difference in both the images results, and we can probably live with it - but I see for the range 150-200MB - it is 445958 vs 445961, and for 200-300 MB it is 3676 vs 3677 and for 300-500 MB it is 3346 vs 3348. I want to understand why is that difference, and which one should I trust more? Speedwise eval seems better, but datawise is it not so correct?

1

1 Answers

2
votes

The problem you're seeing is your rangemap has overlapping values.

Whereas with the eval format, you're trimming the ranges "properly" with case.

Sidebar - you can make that case simpler thusly:

| eval size_group = case(SizeInMB < 150, "0-150 MB", SizeInMB < 200, "150-200 MB", SizeInMB < 300, "200-300 MB", SizeInMB < 500, "300-500 MB", SizeInMB < 1000, "500-1000 MB", 0=0, ">1000 MB") 

Since case expressions stop evaluating as soon as a match is made, no need to use AND as you'd had it. And using 0=0 for your last possibility will always evaluate true (think of default in case statements in C or C++).