0
votes

I have a use case where i have to fetch 10k records each from two different databases and do some data enrichment and push this 20k record in batches to a third database.

Approach I followed:

  1. fetch data from the databases using a scatter gather so that i have both the payloads in the same Mule payload and i can access them as payload[0].payload and payload[1].payload
  2. Used a dataweave transformer to join the records
  3. used a for each loop to insert data into the third database in batch size of 2k

But when doing this I often face MULE JVM error meaning in my transform message component.

Message               : java.lang.StackOverflowError.
Error type            : MULE:FATAL_JVM_ERROR

Is there any blogs or any design pattern in mule to better address this issue?

Dataweave code:

<ee:transform doc:name="Outer Join And Merge" doc:id="fd801b56-9992-4a89-95a3-62ab4c4dc5a2">
<ee:message>
<ee:set-payload>%dw 2.0
import * from dw::core::Arrays
output application/java
var joinedData = outerJoin(vars.databaseOneRecords,vars.databaseTwoRecords,(obj)->obj.StudentID,(obj)->obj.RollNumber)
---
joinedData reduce ((item, acc = {
    'matched': [],
    'unmatched': []
})      
                            ->  if(item.l != null and item.r != null)
                                    {
    matched: (acc.matched default []) ++ [item.l ++ item.r],
    unmatched: acc.unmatched
}  
                                    else {
    matched: acc.matched,
    unmatched: (acc.unmatched default []) ++ [ if(item.l != null) item.l else item.r ]
} )</ee:set-payload>
</ee:message>
</ee:transform>



1

1 Answers

1
votes

I usually see 2 patterns on the "ETL" world:

  1. In-Memory: Get a machine with a huge amount of RAM and process all in memory, like the approach you are taking.
  2. Classic ETL: They use a staging area, like an intermediate DB, where you drop all the information and then do some query to extract everything you need in one shot. This approach uses much less RAM.

Specifically talking about Mule, it depends on how you get the information from the streams if they go full on-memory or not. For example, if you do something like groupBy or do some search on it that force it to go back, it's probably that the full stream gets into Memory.