5
votes

First of all thank you very much for considering my question. Hope it's not too silly.

I am just wondering whether there is a way to filtering data on Kinesis Stream at the point of getting the data record out of the stream. The AWS official doc says the partition key is used to

"allows the consumer that processes a particular shard to be designed with the assumption that records with the same partition key would only be sent to that consumer"

There is no way to specify (neither using the REST API, nor using KCL) which partition key that I am interested in reading data record of directly.


Data record with same partition key will be hashed to same shards but how we could know which shard it is by just knowing the partition key ?

Ultimate question is: How Can I create a consumer that only receiving data of a particular partition key ? / How can I create consumer that only receiving data that it is interested in.

Thank you very much for your time considering my question and sharing you thoughts !


UPDATE 2021-02-10 :

Had this conclusion eariler than this date but just happen to revisit this question at this date.

For the benefit of those who just read it or started using Kinesis:

I think "Sharding in general" is (or was, not sure the current state of sharding) not designed for implementing business logic but mainly for handling the scaling of data volume (a big data technique - in my simple understanding)

Again, not sure about Kinesis today but the requirement still stands and I guess Kafka is the answer to this question however, however Kafka might still not provide you the functionality you need out of box.

1
use a single shard? But seriously, this is not a good usage of a service like Kinesis that is able to handle thousands of events per second from millions of sources. The partition key is designed to distribute the events evenly within the dynamic number of shards in your stream. What is so unique with this partition key? What are you really trying to do?Guy
Hi Guy, thank you very much for commenting. Can I create a consumer that only read data record of a particular partition key rather than just reading data record from one shard and then decide what to do with it based on partition key. What I want to achieve is to allow consumers to specify what data records they want to read if there are multiple producers putting on Stream and multiple consumers which are interested in only particular "type" / "group" of data.CrazyGreenHand
not as far as I know. You are describing more a pub-sub or a queue system. If you have different workers for different types of events, check out AWS Simple Queue Service (SQS) and create a queue for each type of event, or check AWS SNS for pub-sub model.Guy
Hi Guy, I am totally agree with you, but I think perhaps Kinesis can be used as a low latency pub-sub system in which more flexibility can be implemented e.g publisher and subscriber can exchange their roles over time etc. Well, thanks for your consideration of my question anyway !CrazyGreenHand
I think you need to create multiple streams based on type of data you are have and your consumers are interested in.kaptan

1 Answers

0
votes

You can use SNS or asynchronous re-invocations of your function.

Read more here where I answered a similar question: https://stackoverflow.com/a/51281888/1988232