I have a CQL table with a 4-field compound key that I want to index in Solr. All of the 4 compound PK fields are 'text' type in CQL and 'string' type in Solr; and 2 of which can potentially contain long strings. When I initialize the Solr core, I see a lot of the following warning message in my system.log:
The actual message is much longer than that (200000+ characters in one line) but I truncated it for readability. A continuous flow of this kind of warning floods my log file from the time that I initialize the core until the indexing process terminates prematurely (yes, Solr is failing to index my data)
Coming from a MySQL background, I know that PKs have a max length (700 bytes in MySQL); so even if there is no mention in either Cassandra or Solr documentation about a similar limit, the first thing that I did is to replace the CQL compound key with a simple text key that contains the sha-1 hash of the 4 fields that are previously part of the compound PK. Viola - the warning is gone and Solr was able to index my data. So my question now is, does Solr have a limit on the length of the uniqueKey? Cassandra doesn't seem to have a problem with long compound PKs (as I was able to query some of my data via CQL), but Solr seems to have a limit.
UPDATE:
After further testing, I found out that somehow, it is the mixture of compound PK and CQL maps in my table schema that is causing the Solr indexing issue.
- Compound PK + no maps (replaced by many columns) = works
- Simple PK (sha-1 hash of compound PK columns) + maps = works
- Compound PK + maps = doesn't work
I'm still not sure if the problem is related to the length of my data in any way.
CQL Table schema:
CREATE TABLE myks.mycf (
phrase text,
host text,
domain text,
path text,
created timestamp,
modified timestamp,
attr1 int,
attr2 bigint,
attr3 double,
attr4 int,
attr5 bigint,
attr6 bigint,
attr7 double,
attr8 double,
scores map<text,int>,
estimates map<text,bigint>,
searches map<text,bigint>,
PRIMARY KEY (phrase,domain,host,path),
) WITH gc_grace_seconds = 1296000
AND compaction={'class': 'LeveledCompactionStrategy'}
AND compression={'sstable_compression': 'LZ4Compressor'}
Solr schema:
<schema name="myks" version="1.5">
<types>
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="string" class="solr.StrField" omitNorms="true"/>
<fieldType name="boolean" class="solr.BoolField" omitNorms="true"/>
<fieldtype name="binary" class="solr.BinaryField"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
</types>
<fields>
<field name="phrase" type="string" indexed="true" stored="true"/>
<field name="host" type="string" indexed="true" stored="true"/>
<field name="domain" type="string" indexed="true" stored="true"/>
<field name="path" type="string" indexed="true" stored="true"/>
<field name="created" type="date" indexed="true" stored="true"/>
<field name="modified" type="date" indexed="true" stored="true"/>
<field name="attr1" type="int" indexed="true" stored="true"/>
<field name="attr2" type="long" indexed="true" stored="true"/>
<field name="attr3" type="double" indexed="true" stored="true"/>
<field name="attr4" type="int" indexed="true" stored="true"/>
<field name="attr5" type="long" indexed="true" stored="true"/>
<field name="attr6" type="long" indexed="true" stored="true"/>
<field name="attr7" type="double" indexed="true" stored="true"/>
<field name="attr8" type="double" indexed="true" stored="true"/>
<!-- CQL collection maps -->
<dynamicField name="scores*" type="int" indexed="true" stored="true"/>
<dynamicField name="estimates*" type="long" indexed="true" stored="true"/>
<dynamicField name="searches*" type="long" indexed="true" stored="true"/>
<!-- docValues - facet -->
<field name="dv__domain" type="string" indexed="true" stored="false" docValues="true" multiValued="true"/>
<field name="dv__attr4" type="int" indexed="true" stored="false" docValues="true" multiValued="true"/>
<field name="dv__attr8" type="double" indexed="true" stored="false" docValues="true" multiValued="true"/>
<!-- docValues - group -->
<field name="dv__phrase" type="string" indexed="true" stored="false" docValues="true" multiValued="true"/>
<!-- docValues - sort -->
<field name="dv__attr2" type="long" indexed="true" stored="false" docValues="true" multiValued="true"/>
<field name="dv__attr5" type="long" indexed="true" stored="false" docValues="true" multiValued="true"/>
<field name="dv__attr1" type="int" indexed="true" stored="false" docValues="true" multiValued="true"/>
</fields>
<!-- Why we use copyFields for docValues: http://stackoverflow.com/questions/26495208/solr-docvalues-usage -->
<copyField source="domain" dest="dv__domain"/>
<copyField source="attr4" dest="dv__attr4"/>
<copyField source="attr8" dest="dv__attr8"/>
<copyField source="phrase" dest="dv__phrase"/>
<copyField source="attr2" dest="dv__attr2"/>
<copyField source="attr5" dest="dv__attr5"/>
<copyField source="attr1" dest="dv__attr1"/>
<defaultSearchField>phrase</defaultSearchField>
<uniqueKey>(phrase,domain,host,path)</uniqueKey>
</schema>
I use CQLSSTableWriter
to generate sstables out of CSVs dumped from MySQL. For the CQL maps, I chose Java HashMap
to represent the values.
I also found out today that even Cassandra seems to have an issue with the mixture of compound PK and maps. When I looked at the filesystem, the copy of the table that uses compound PK + maps has a much smaller folder size than the copies that use either simple PK + maps or compound PK + no maps