0
votes

OK, so I have been beating my head against Apache Beam for a few weeks now. I am learning, but keep getting stuck on what seems so trivial. I have about 60 million rows of data in two separate CSV files. The rows consist of ints and floats. I'll ask my question, but I think it helps that I provide the context of how this is part of a bigger process, which I'll explain afterwards.

Each row in my PCollection looks like this after read in: '11139422, 11139421, 11139487, 11139449, 11139477, 27500, 60.75, 60.75, 60.75'

I first convert it to look like this: '11139422', '11139421', '11139487', '11139449', '11139477', '27500', '60.75', '60.75', '60.75'

I then want to create turn each of the values into a tuple pair so that I can add values. For example, I would like it to look like this for each row in the PCollection: (p1, 11139422), (p2, 11139421), (p3, 11139487), (p4, 11139449), (p5, 11139477), (sal, 27500), (fp, 60.75), (bp, 60.75), (pp, 60.75)

If I am understanding how to allow the parellel processing to execute efficiently, I THINK I should turn each row into a dictionary with some type of hashed key: some_hashed_key: (collection of tupledtag values from above). I have not done any work yet on my next step because I am currently stuck here. My next step is to basically perform a cartesion product between two PCollections. Both will be formatted almost exactly the same as above. My plan is to broadcast each dictionary key from the left PCollection to every dictionary key on the right PCollection, add some values together between the PCollections, and then Flatten it all into one PCollection and send to Pub/Sub queue. Again, I'm just providing context, not asking anybody to write that code for me, thanks!

1

1 Answers

0
votes

Got sooooooo lucky! I found the answer here:

How to convert csv into a dictionary in apache beam dataflow

So, no transforms required, it's a built in function that appends the key:value pairs automatically on read in. Hope somebody stumbles on this post and it makes their day as good as mine!