0
votes

I am trying to count distinct sessionIds from a measurement. sessionId being a tag, I count the distinct entries in a "parent" query, since distinct() doesn't works on tags. In the subquery, I use a group by sessionId limit 1 to still benefit from the index (if there is a more efficient technique, I have ears wide open but I'd still like to understand what's going on).

I have those two variants:

>  select count(distinct(sessionId)) from (select * from UserSession group by sessionId limit 1)
name: UserSession
time count
---- -----
0    3757
>  select count(sessionId) from (select * from UserSession group by sessionId limit 1)
name: UserSession
time count
---- -----
0    4206

To my understanding, those should return the same number, since group by sessionId limit 1 already returns distinct sessionIds (in the form of groups).

And indeed, if I execute:

select * from UserSession group by sessionId limit 1

I have 3757 results (groups), not 4206.

In fact, as soon as I put this in a subquery and re-select fields in a parent query, some sessionIds have multiple occurrences in the final result. Not always, since there is 17549 rows in total, but some are.

This is the sign that the limit 1 is somewhat working, but some sessionId still get multiple entries when re-selected. Maybe some kind of undefined behaviour?

1

1 Answers

1
votes

I can confirm that I get the same result. In my experience using nested queries does not always deliver what you expect/want.

Depending on how you use this you could retrieve a list of all values for a tag with:

SHOW TAG VALUES FROM UserSession WITH KEY=sessionId

Or to get the cardinality (number of distinct values for a tag):

SHOW TAG VALUES EXACT CARDINALITY FROM UserSession WITH KEY=sessionId.

Which will return a single row with a single column count, containing a number. You can remove the EXACT modifier if you don't need to be exact about the result: SHOW TAG VALUES CARDINALITY on Influx Documentation.