0
votes

I am trying to write a search that queries our directory server running openldap.

The users are going to be searching using the first or last name of the person they're interested in.

I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.

If I use '(cn=*Perez*)' I get only the non-accented results.

If I use '(cn=*Pérez*)' I get only accented results.

If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...

In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).

Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?

2

2 Answers

0
votes

You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.

#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.

Then modify your query to use an or expression.

(|(cn=Pérez)(cn={stripped}Perez))

And you would include a valuesReturnFilter that looked like

(!(cn={stripped}*))

See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.

0
votes

Search filters ("queries") are specified by RFC2254.

Encoding: RFC2254 actually requires filters (indirectly defined) to be an OCTET STRING, i.e. ASCII 8-byte String: AttributeValue is OCTET STRING, MatchingRuleId and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.

The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters (https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5). Quote:

The <valueencoding> rule ensures that the entire filter string is a valid UTF-8 string and provides that the octets that represent the ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII 0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits representing the value of the encoded octet.

Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".