0
votes

Hello ArangoDB community,

I have imported two collections from sqlite to ArangoDB with arangoimport (via a CSV).
Next, I try to run a simple AQL to cross reference these collections (with an end goal to connect them via edges).
Collection1 has 1,682,642 documents
Collection2 has 3,290 documents

The following AQL takes a whopping 30 seconds to complete:

FOR c1 IN Collection1
   FOR c2 IN Collection2
      FILTER c2._key == TO_STRING(c1.someField) return {"C2": c2._id, "C1": c1._id}

If I switch the conversion like so, it takes forever (I abandoned after 5 minutes):

FOR c1 IN Collection1
       FOR c2 IN Collection2
          FILTER TO_NUMBER(c2._key) == c1.someField return {"C2": c2._id, "C1": c1._id}

Adding an index on "someField" didn't help.
The same JOIN query in Sqlite (from which the data was imported) takes less than 1 second to complete

A few thoughts and questions:
1) How can I know the data types of the fields in a document?
2) _key is a string. I think "someField" is a number (because without the TO_STRING, no results returned).
3) Is adding TO_STRING on "someField" effectively makes the index on the field unusable?
4) Is there a way to make _key a number (preferably an integer). I think number comparison is faster, is it not?
5) Alternatively, can I tell arangoimport to force "someField" to be a string?
6) Is there anything else I can do to make the AQL run faster?

Any input appreciated,
Elad

1

1 Answers

1
votes
  1. The supported data types follow the JSON specs. You can determine the data types by looking at a document, e.g. using the Web UI. Use the Code view mode in the document editor to see the document as JSON:

    Document Editor in Code view mode

    "Aerosmith" is a string, 1973 is a number, the genres are string in an [ ... ] array and each song is an { ... } object. There are also null, true and false literals.

    For a programmatic way to determine the data type of an attribute there are Type check functions, e.g. TYPENAME() to return the data type name as string. Example query to count how often the attribute someField is of which data type:

    RETURN MERGE( FOR c1 IN Collection1
      COLLECT type = TYPENAME(c1.someField) WITH COUNT INTO count
      RETURN { [type]: count }
    )
    
  2. _key is always a string indeed. You can use above query if you are unsure what someField is. Please share this information.

  3. If you cast a value which is only known at run-time (here: document attribute) to a different type then yes, no index can be utilized. An index lookup is only possible if you query for a value as-is. However, you may type-cast bind variables and other constant values, as they are known at query compile time.

  4. No, the document key is always a string. There is an index on the _key attribute (the primary index), hence there is no performance penalty because it is a string instead of a numeric value.

  5. arangoimport has an option to convert numeric strings to numbers, "null" to null and "true" / "false" to Boolean values (--convert), but there is no option to force an attribute to become a string. There is a feature request to add the ability to prescribe the desired data types.

    In case you want numeric strings to stay strings, use --convert false to turn the auto-conversion off. If the values are numbers in the source file (not in quote marks), then you can adjust the file before you import it. You can also use a one-off AQL query to convert an attribute to a certain data type:

    FOR c1 IN Collection1
      UPDATE doc WITH { someField: TO_STRING(someField) } IN Collection1
    
  6. I assume that in SQLite the primary key was an integer value and therefore references to it as well (foreign keys). Because the primary key must be a string in ArangoDB, the references need to be of type string as well. Change the documents to store foreign keys as strings as well. Add a hash index to Collection1 on someField (the field you use for a join). Then this query should be fast and return the expected result:

    FOR c1 IN Collection1
      FOR c2 IN Collection2
        FILTER c2._key == c1.someField
        RETURN { C2: c2._id, C1: c1._id }