0
votes

I need to process about 4,000 cassandra queries. I convert each query ResultSet into a generator to keep the memory footprint low. Within each row of the generator, I'm only concerned with a few fields of about 50 present.

I know that I can't filter directly on value fields in CQL but does the DataStax Python Cassandra driver have something built in that does this? or would it make more sense to just do this when I build the generator i.e

def make_gen(response):
    for row in response:
        yield row.value.field1, row.value.filed2

I am issuing direct queries at the moment but will move to model based approach later with concurrent queries and prepared statements. The code that is issuing the request is very basic

sess = connect_cas(env)
for user in users:
    q = 'select * from table ' + \
        'where key1 = {} and '.format(key_1) + \
        'key2 = {} and '.format(key_2) + \
        'sample_time > {} '.format(t1) + \
        'sample_time < {} '.format(t2)
   resp_gen = make_gen(sess.execute(q)) # just a yield json.loads(Row.value)
   for resp in resp_gen:
       if field in resp:
           // process data from this field

I only care about rows where this "field" is present. I've since updated my generator to only yield data when this condition is true, however, if there is something built into the DataStax driver that does this more efficiently, at 4,000 queries the savings will add up.

1
Please show the code that is doing request - are you using Model-based approach? Or direct query? - Alex Ott

1 Answers

0
votes

Are you showing that you only process rows where field1 or field2 are set to a particular value?

It's not exactly built for this purpose, but you could use a custom row_factory to achieve this filtering at a lower level and avoid conversions between named tuple, tuple, and additional generator.