1
votes

I have a large list of strings (contains usernames, approx 350K records). I need to store it sorted by lexicographical ordering, and ought to be able to retrieve member existence* and member likeness** efficiently. Redis sorted sets look like the data type for the job.

However, I seem to be falling at the first hurdle. Specifically, one of my key requirements is to keep different letter cases together as long as they start with the same letter. E.g. both Bender and bender should end up being ordered side by side. However, redis' sorted sets are strict in following lexicographical ordering rules, thus all strings starting with uppercase are by default sorted before all strings starting with lower case (e.g. Z is ordered before a, but after A).

Is there any way I can workaround this and still use redis sorted sets to fulfil my requirements? FYI, I'm using redis version 2.8.4. Thanks in advance.


*member existence: given a username, check whether it already exists in the stored set

**member likeness: given a username, pull up N stored usernames that are most like the given username

1
@thepirat000: this is a great post. Only issue is it utilizes ZRANGEBYLEX which isn't available for redis 2.8.4. I can implement an earlier version of the same: oldblog.antirez.com/post/autocomplete-with-redis.html, except that isn't lexicographically agnostic.Hassan Baig

1 Answers

1
votes

You need to do some special encoding with the names. The following is an example.

Let's suppose the length of all names are less than 100 characters. For each name, do the following steps to encode it:

  1. record the indices of upper case letters with 2 digits: for BeNd, the indices are 00 and 02.
  2. convert upper case letters of the name into lower cases to get a lower case name: from BeNd to bend
  3. append the indices to the lower case name to get the encoded name: from bend to bend0002
  4. add the encoded name into the sorted set: zadd key 0 bend0002

In this way, BeNd and bend should be ordered side by side.

When you want to do the search, use the same encoding method to encode the given name, do the search, and decode the results. Since the encoded name records the indices of upper case letters, you can decode it easily.