1
votes

I am using bulbs and rexster and am trying to store nodes with unicode properties (see example below). Apparently, creating nodes in the graph works properly as I can see the nodes in the web interface that comes with rexster (Rexster Dog House) but retrieving the same node does not work - all I get is None.

Everything works as expected when I create and look for nodes with non-unicode-specific letters in their properties. E.g. in the following example a node with name = u'University of Cambridge' would be retrievable as expected.

Rexster version:

[INFO] Application - Rexster version [2.4.0]

Example code:

# -*- coding: utf-8 -*-


from bulbs.rexster import Graph
from bulbs.model import Node
from bulbs.property import String
from bulbs.config import DEBUG
import bulbs

class University(Node):
    element_type = 'university'
    name = String(nullable=False, indexed=True)


g = Graph()
g.add_proxy('university', University)
g.config.set_logger(DEBUG)

name = u'Université de Montréal'

g.university.create(name=name)

print g.university.index.lookup(name=name)

print bulbs.__version__

Gives the following output on the command line:

POST url: http://localhost:8182/graphs/emptygraph/tp/gremlin
POST body: {"params": {"keys": null, "index_name": "university", "data": {"element_type": "university", "name": "Universit\u00e9 de Montr\u00e9al"}}, "script": "def createIndexedVertex = {\n vertex = g.addVertex()\n index = g.idx(index_name)\n for (entry in data.entrySet()) {\n if (entry.value == null) continue;\n vertex.setProperty(entry.key,entry.value)\n if (keys == null || keys.contains(entry.key))\n\tindex.put(entry.key,String.valueOf(entry.value),vertex)\n }\n return vertex\n }\n def transaction = { final Closure closure ->\n try {\n results = closure();\n g.commit();\n return results; \n } catch (e) {\n g.rollback();\n throw e;\n }\n }\n return transaction(createIndexedVertex);"} GET url: http://localhost:8182/graphs/emptygraph/indices/university?value=Universit%C3%A9+de+Montr%C3%A9al&key=name
GET body: None None 0.3

1

1 Answers

2
votes

Ok, I finally got to the bottom of this.

Since TinkerGraph uses a HashMap for its index, you can see what's being stored in the index by using Gremlin to return the contents of the map.

Here's what's being stored in the TinkerGraph index using your Bulbs g.university.create(name=name) method above...

$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="g.idx(\"university\").index"
{"results":[{"name":{"Université de Montréal":[{"name":"Université de Montréal","element_type":"university","_id":"0","_type":"vertex"}]},"element_type":{"university":[{"name":"Université de Montréal","element_type":"university","_id":"0","_type":"vertex"}]}}],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":3.732632}

All that looks good -- the encodings look right.

To create and index a vertex like the one above, Bulbs uses a custom Gremlin script via an HTTP POST request with a JSON content type.

Here's the problem...

Rexster's index lookup REST endpoint uses URL query params, and Bulbs encodes URL params as UTF-8 byte strings.

To see how Rexster handles URL query params encoded as UTF-8 byte strings, I executed a Gremlin script via a URL query param that simply returns the encoded string...

$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%C3%A9%20de%20Montr%C3%A9al'"
{"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":16.59432}

Egad! That's not right. As you can see, that text is mangled.

In a twist of irony, we have Gremlin returning gremlins, and that's what Rexster is using for the key's value in the index lookup, which as we can see is not what's stored in TinkerGraph's HashMap index.

Here's what's going on...

This is what the unquoted byte string looks like in Bulbs:

>>> name
u'Universit\xe9 de Montr\xe9al'

>>> bulbs.utils.to_bytes(name)
'Universit\xc3\xa9 de Montr\xc3\xa9al'

'\xc3\xa9' is the UTF-8 encoding of the unicode character u'\xe9' (which can also be specified as u'\u00e9').

UTF-8 uses 2 bytes to encode a character, and Jersey/Grizzly 1.x (Rexster's app server) has a bug where it doesn't properly handle 2-byte character encodings like UTF-8.

See http://markmail.org/message/w6ipdpkpmyghdx2p

It looks like this is fixed in Jersey/Grizzly 2.0, but switching Rexster from Jersey/Grizzly 1.x to Jersey/Grizzly 2.x is a big ordeal.

Last year TinkerPop decided to switch to Netty instead, and so for the TinkerPop 3 release this summer, Rexster is in the process of morphing into Gremlin Server, which is based on Netty rather than Grizzly.

Until then, here are few workarounds...

Since Grizzly can't handle 2-byte encodings like UTF-8, client libraries need to encode URL params as 1-byte latin1 encodings (AKA ISO-8859-1), which is Grizzly's default encoding.

Here's the same value encoded as a latin1 byte string...

 $ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%E9%20de%20Montr%E9al'"
{"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":17.765313}

As you can see, using a latin1 encoding works in this case.

However, for general purposes, it's probably best for client libraries to use a custom Gremlin script via an HTTP POST request with a JSON content type and thus avoid the URL param encoding issue all together -- this is what Bulbs is going to do, and I'll push the Bulbs update to GitHub later today.

UPDATE: It turns out that even though we cannot change Grizzly's default encoding type, we can specify UTF-8 as the charset in the HTTP request Content-Type header and Grizzly will use it. Bulbs 0.3.29 has been updated to include the UTF-8 charset in its request header, and all tests pass. The update has been pushed to both GitHub and PyPi.