1
votes

In pentaho , when I run a cassandra input step that get around 50,000 rows , I get this exception :

Is there a way to control the query result size in pentaho ? or is there a way to stream the query result and not get it all in bulk?

2014/10/09 15:14:09 - Cassandra Input.0 - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : Unexpected error
2014/10/09 15:14:09 - Cassandra Input.0 - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : org.pentaho.di.core.exception.KettleException: 
2014/10/09 15:14:09 - Cassandra Input.0 - Frame size (17727647) larger than max length (16384000)!
2014/10/09 15:14:09 - Cassandra Input.0 - Frame size (17727647) larger than max length (16384000)!
2014/10/09 15:14:09 - Cassandra Input.0 - 
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.pentaho.di.trans.steps.cassandrainput.CassandraInput.initQuery(CassandraInput.java:355)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.pentaho.di.trans.steps.cassandrainput.CassandraInput.processRow(CassandraInput.java:234)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
2014/10/09 15:14:09 - Cassandra Input.0 -   at java.lang.Thread.run(Unknown Source)
2014/10/09 15:14:09 - Cassandra Input.0 - Caused by: org.apache.thrift.transport.TTransportException: Frame size (17727647) larger than max length (16384000)!
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:137)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql_query(Cassandra.java:1656)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.apache.cassandra.thrift.Cassandra$Client.execute_cql_query(Cassandra.java:1642)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.pentaho.cassandra.legacy.LegacyCQLRowHandler.newRowQuery(LegacyCQLRowHandler.java:289)
2014/10/09 15:14:09 - Cassandra Input.0 -   at org.pentaho.di.trans.steps.cassandrainput.CassandraInput.initQuery(CassandraInput.java:333)
2014/10/09 15:14:09 - Cassandra Input.0 -   ... 3 more
2014/10/09 15:14:09 - Cassandra Input.0 - Finished processing (I=0, O=0, R=0, W=0, U=0, E=1)
2014/10/09 15:14:09 - all customer data - Transformation detected one or more steps with errors.
2014/10/09 15:14:09 - all customer data - Transformation is killing the other steps!
4
I am also using cassandra but never face such error, try to increase read_request_timeout_in_ms in cassandra.yaml and Xmx1024m in pentaho.bat or .sh according to your OS and check whether you are facing such error or not. - Helping Hand..
How big are your queries? Are you issuing queries that return around 60,000 or more rows with 5 columns? - Moataz Soliman
i am returning more then 200000 rows and has 7 columns. - Helping Hand..
Which version of pentaho are you using? - Moataz Soliman
Are you using the free edition or the full edition that u had to pay fore? - Moataz Soliman

4 Answers

2
votes
org.apache.thrift.transport.TTransportException: 
  Frame size (17727647) larger than max length (16384000)!

A limit is enforced for how large frames (thrift messages) can be to avoid performance degradation. You can tweak this by modifying some settings. The important thing to note here is that you need to set the settings bot client size and server side.

Server side in cassandra.yaml

# Frame size for thrift (maximum field length).
# default is 15mb, you'll have to increase this to at-least 18.
thrift_framed_transport_size_in_mb: 18 

# The max length of a thrift message, including all fields and
# internal thrift overhead.
# default is 16, try to keep it to thrift_framed_transport_size_in_mb + 1
thrift_max_message_length_in_mb: 19

Setting the client side limit depends on what driver you're using.

0
votes

I resolved these problem by using PDI 5.2 which has the property in Cassandra Input step called as max_length setting this property to higher value like 1GB solves these problem.

0
votes

You can try the following method on the server side:

TNonblockingServerSocket tnbSocketTransport = new TNonblockingServerSocket(listenPort);
TNonblockingServer.Args tnbArgs = new TNonblockingServer.Args(tnbSocketTransport);

// maxLength is configured to 1GB,while the default size is 16MB

tnbArgs.transportFactory(new TFramedTransport.Factory(1024 * 1024 * 1024));
tnbArgs.protocolFactory(new TCompactProtocol.Factory());
TProcessor processor = new UcsInterfaceThrift.Processor<UcsInterfaceHandler>(ucsInterfaceHandler);
tnbArgs.processor(processor);
TServer server = new TNonblockingServer(tnbArgs);
server.serve();
0
votes

Well it did work for me..

Cassandra Version: [cqlsh 5.0.1 | Cassandra 2.2.1 | CQL spec 3.3.0 | Native protocol v4]

Pentaho PDI Version: pdi-ce-5.4.0.1-130

Changed Settings in cassandra.yaml:

# Whether to start the thrift rpc server.
start_rpc: true

# Frame size for thrift (maximum message length).
thrift_framed_transport_size_in_mb: 35

Cassandra Output Step Settings Changed to:

Port: 9160
"Use CQL Version 3": checked