4
votes

We are currently working on integrating the Ledger Sync Service in our Cordapp: https://github.com/corda/corda-solutions/tree/master/bn-apps/ledger-sync

During our own tests, we experienced that in certain circumstances the ledger is not successfully synchronized/repaired after a crash.

Our test does the following:

  • Node A and B transact with each other, creating a State S.
  • Node B crashes and recovers to a state, where it does not know S.
  • Node A creates a new transaction, that consumes state S
  • Node B uses the ledger sync service to recover all states.

In the background, the following happens: When node A creates the Tx that consumes the state S, node B will also receive the old Tx that created state S as dependency. From that point, the Tx is recorded in the database of node B and can be retrieved by calling serviceHub.validatedTransactions.getTransaction(txId).

However, querying the vault for CONSUMED or ALL states will not return the old state S. Running the ledger sync will report that the node is out of sync, saying that the transaction that created state S is missing.

Calling the repair will not successfully repair and consecutive runs of RequestLedgersSyncFlow will keep reporting missing transactions.

I am not sure that this use case is actually supported (creating Txs while the ledger is out of sync) but I think if it is not a supported use case, it is hard to make sure that nodes are not transacting with each other when one of the nodes is out of sync.

I hope the issue is clear, otherwise I can also prepare and provide a test for it.

Update: Upon request, here I created a fork of the Corda Solutions repo and added a test that showcases the error: https://github.com/marioschlipf/corda-solutions/commit/fe1ab5917c971fcf9732bf8af7d0f2c1800b5e37

1
Please add a test if possible, could you also provide the snippet you use to query ALL states.Adrian
Can you provide details on how you simulate the crash of a node? github.com/corda/corda-solutions/blob/… will skip over any transactions that are still available to the node (be it in a cached state or in the vault).mritz_p
@Adrian I added a test, see my original questionmario.schlipf
@mritz_p We are backing up the h2 database and restoring it with the node shut down. I think this is a more robust test than fiddling around with SQL. However, I managed to showcase the bug with the test setup that R3 uses in their setup. See my original question. Edit: I have seen the line of code you mentioned, removing this filter does not change anything, tested this already.mario.schlipf
@mario.schlipf shouldn't the assertion in l. 186 be assertEquals(0, ledgerSyncResult2[node1.fromNetwork().identity()]!!.missingAtRequester.size). Because you ran recovery before?mritz_p

1 Answers

2
votes

I have recreated the scenario with four nodes running Ledger Sync Service built from master (most recent commit 839dfb8772c3b08447183a84e336a527a0f3975b). I have modified BogusFlow in the following way to allow for consumption of the input state:

/**
 * A trivial flow that is merely used to illustrate synchronisation by persisting meaningless transactions in
 * participant's vaults
 */
@InitiatingFlow
@StartableByRPC
class BogusFlow(
        private val them: Party,
        private val precursor: UniqueIdentifier? = null
) : FlowLogic<SignedTransaction>() {

    @Suspendable
    override fun call(): SignedTransaction {
        val notary = serviceHub.networkMapCache.notaryIdentities.first()

        val cmd = Command(BogusContract.Commands.Bogus(), listOf(them.owningKey))

        val builder = TransactionBuilder(notary)

        precursor?.let {
            val result = serviceHub.vaultService.queryBy(BogusState::class.java, LinearStateQueryCriteria(linearId = listOf(it)))
            val inputState = result.states.single()
            builder.addInputState(inputState)
        }

        builder.addOutputState(BogusState(ourIdentity, them), BOGUS_CONTRACT_ID)
                .addCommand(cmd).apply {
                    verify(serviceHub)
                }

        val partiallySigned = serviceHub.signInitialTransaction(builder)

        val session = initiateFlow(them)

        val fullySigned = subFlow(CollectSignaturesFlow(partiallySigned, setOf(session)))

        return subFlow(FinalityFlow(fullySigned))
    }
}

The CorDapp containing this flow is deployed to three nodes (Alice A, Bob B, Charlie C). A non-validating Notary (N) is used.

Consider the following steps to simulate failure and restore.

  1. Start A, B, C and N using H2 as database
  2. As A, invoke net.corda.businessnetworks.ledgersync.BogusFlow, targeting O=Bob Ltd., L=London, C=GB
  3. Shut down node A and destroy the database i.e. rm persistence.mv.db
  4. As B, run a vaultQuery for contractStateType net.corda.businessnetworks.ledgersync.BogusState to validate B has knowledge of the unconsumed state following 2. The output should contain a linearId. Take note of this ID.
  5. As B, start a flow with C, utilising the linearId obtained in 4 as precursor. I.e. flow start net.corda.businessnetworks.ledgersync.BogusFlow them: "O=Charlie SARL, L=Paris, C=FR", precursor: "2429c289-0ccb-4adb-9714-32ee3d0d7f12". Note that in a production use case your contract code might prohibit this transaction from being executed in the first place, given A has not signed it.
  6. As B, run vaultQuery contractStateType: net.corda.businessnetworks.ledgersync.BogusState and validate that there is an unconsumed state with participants B and C (i.e. "participants" : [ "O=Bob Ltd., L=London, C=GB", "O=Charlie SARL, L=Paris, C=FR" ]).
  7. As A, bring the node back up, creating a new H2 database.
  8. As A, start EvaluateLedgerConsistencyFlow (i.e. connection.proxy.startFlow(::EvaluateLedgerConsistencyFlow, listOf(alice, bob, charlie))). This should return {O=Bob Ltd., L=London, C=GB=false, O=Charlie SARL, L=Paris, C=FR=true} indicating A is out of sync with B.
  9. As A, run RequestLedgersSyncFlow (i.e. connection.proxy.startFlow(::RequestLedgersSyncFlow, listOf(alice, bob, charlie))). This will return a summary of missing transactions (e.g. {O=Bob Ltd., L=London, C=GB=LedgerSyncFindings(missingAtRequester=[BAA58E9E9E2025181F00459FCE8B0D035705A38D1068A0F4C4BAB53F3F56FB40], missingAtRequestee=[]), O=Charlie SARL, L=Paris, C=FR=LedgerSyncFindings(missingAtRequester=[], missingAtRequestee=[])}).
  10. As A, run TransactionRecoveryFlow, passing in the result of 9. E.g. connection.proxy.startFlow(::TransactionRecoveryFlow, report) where report is the result of the previous step.
  11. As A, to validate re-run EvaluateLedgerConsistencyFlow which will return the result {O=Bob Ltd., L=London, C=GB=true, O=Charlie SARL, L=Paris, C=FR=true}, indicating the discrepancy was resolved.
  12. To validate furthermore, as A, run a vault query (i.e. VaultQueryCriteria(status = ALL), PageSpecification(), Sort(emptyList()), BogusState::class.java) to retrieve the contents and verify that the state has been recreated.

Does this cover the scenario you are describing?