0
votes

I have a CQL table with a 4-field compound key that I want to index in Solr. All of the 4 compound PK fields are 'text' type in CQL and 'string' type in Solr; and 2 of which can potentially contain long strings. When I initialize the Solr core, I see a lot of the following warning message in my system.log:

pastebin

The actual message is much longer than that (200000+ characters in one line) but I truncated it for readability. A continuous flow of this kind of warning floods my log file from the time that I initialize the core until the indexing process terminates prematurely (yes, Solr is failing to index my data)

Coming from a MySQL background, I know that PKs have a max length (700 bytes in MySQL); so even if there is no mention in either Cassandra or Solr documentation about a similar limit, the first thing that I did is to replace the CQL compound key with a simple text key that contains the sha-1 hash of the 4 fields that are previously part of the compound PK. Viola - the warning is gone and Solr was able to index my data. So my question now is, does Solr have a limit on the length of the uniqueKey? Cassandra doesn't seem to have a problem with long compound PKs (as I was able to query some of my data via CQL), but Solr seems to have a limit.

UPDATE:

After further testing, I found out that somehow, it is the mixture of compound PK and CQL maps in my table schema that is causing the Solr indexing issue.

  1. Compound PK + no maps (replaced by many columns) = works
  2. Simple PK (sha-1 hash of compound PK columns) + maps = works
  3. Compound PK + maps = doesn't work

I'm still not sure if the problem is related to the length of my data in any way.

CQL Table schema:

CREATE TABLE myks.mycf (
  phrase text,
  host text,
  domain text,
  path text,
  created timestamp,
  modified timestamp,

  attr1 int,
  attr2 bigint,
  attr3 double,
  attr4 int,
  attr5 bigint,
  attr6 bigint,
  attr7 double,
  attr8 double,

  scores map<text,int>,
  estimates map<text,bigint>,
  searches map<text,bigint>,

  PRIMARY KEY (phrase,domain,host,path),
) WITH gc_grace_seconds = 1296000
AND compaction={'class': 'LeveledCompactionStrategy'}
AND compression={'sstable_compression': 'LZ4Compressor'}

Solr schema:

<schema name="myks" version="1.5">
  <types>
    <fieldType name="text" class="solr.TextField">
     <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
     </analyzer>
    </fieldType>
    <fieldType name="string" class="solr.StrField" omitNorms="true"/>
    <fieldType name="boolean" class="solr.BoolField" omitNorms="true"/>
    <fieldtype name="binary" class="solr.BinaryField"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
  </types>
  <fields>
    <field name="phrase" type="string" indexed="true" stored="true"/>
    <field name="host" type="string" indexed="true" stored="true"/>
    <field name="domain" type="string" indexed="true" stored="true"/>
    <field name="path" type="string" indexed="true" stored="true"/>
    <field name="created" type="date" indexed="true" stored="true"/>
    <field name="modified" type="date" indexed="true" stored="true"/>

    <field name="attr1" type="int" indexed="true" stored="true"/>
    <field name="attr2" type="long" indexed="true" stored="true"/>
    <field name="attr3" type="double" indexed="true" stored="true"/>
    <field name="attr4" type="int" indexed="true" stored="true"/>
    <field name="attr5" type="long" indexed="true" stored="true"/> 
    <field name="attr6" type="long" indexed="true" stored="true"/>
    <field name="attr7" type="double" indexed="true" stored="true"/>
    <field name="attr8" type="double" indexed="true" stored="true"/>

    <!-- CQL collection maps -->
    <dynamicField name="scores*" type="int" indexed="true" stored="true"/>
    <dynamicField name="estimates*" type="long" indexed="true" stored="true"/>
    <dynamicField name="searches*" type="long" indexed="true" stored="true"/>

    <!-- docValues - facet -->
    <field name="dv__domain" type="string" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr4" type="int" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr8" type="double" indexed="true" stored="false" docValues="true" multiValued="true"/>

    <!-- docValues - group -->
    <field name="dv__phrase" type="string" indexed="true" stored="false" docValues="true" multiValued="true"/>

    <!-- docValues - sort -->
    <field name="dv__attr2" type="long" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr5" type="long" indexed="true" stored="false" docValues="true" multiValued="true"/>
    <field name="dv__attr1" type="int" indexed="true" stored="false" docValues="true" multiValued="true"/>
  </fields>

  <!-- Why we use copyFields for docValues: http://stackoverflow.com/questions/26495208/solr-docvalues-usage -->
  <copyField source="domain" dest="dv__domain"/>
  <copyField source="attr4" dest="dv__attr4"/>
  <copyField source="attr8" dest="dv__attr8"/>
  <copyField source="phrase" dest="dv__phrase"/>
  <copyField source="attr2" dest="dv__attr2"/>
  <copyField source="attr5" dest="dv__attr5"/>
  <copyField source="attr1" dest="dv__attr1"/>

  <defaultSearchField>phrase</defaultSearchField>
  <uniqueKey>(phrase,domain,host,path)</uniqueKey>
</schema>

I use CQLSSTableWriter to generate sstables out of CSVs dumped from MySQL. For the CQL maps, I chose Java HashMap to represent the values.

I also found out today that even Cassandra seems to have an issue with the mixture of compound PK and maps. When I looked at the filesystem, the copy of the table that uses compound PK + maps has a much smaller folder size than the copies that use either simple PK + maps or compound PK + no maps

1
Can you share the error you see when you create your core?phact
There is no error when I create the core. The problem happens during indexing. I tried two variations: 1. Create the table, import all data to Cassandra, then initialize the Solr core; 2. Create the table, initialize the Solr core, then import all data. Both of which lead to the same scenario - the indexing process in each node reaches at most 2% (usually 0% or 1%) and exits without any error message aside from the many warning lines that flood the system.logPJ.
Hmm okay, I was asking because the tombstone errors are really cassandra related. Not sure how they might affect indexing. Are you doing lots of deletes?phact
No. This is a table is not in production use yet (still in bulk-loading stage). Also, see my update in the OP. Might give a cluePJ.
Can you share your tables, will try to reproduce.phact

1 Answers

0
votes

Cassandra does have a limit of 64K for keys.

Generally in Solr, "text" should not be used for the key since it is tokenized. Use a "string" field instead.

As the Cassandra FAQ wiki notes, a hash is a better choice for using long text values for keys: http://wiki.apache.org/cassandra/FAQ#max_key_size

Ultimately, it comes down to how you wish to query the Solr documents.

The general guidance for "limits" in Solr is simply to "be reasonable" - big anything is very likely to cause you problems down the road somewhere.