Backing out data from an MD5 checksum

Question

Imagine you have an MD5 sum that was calculated from an array of N 64-byte elements. I want to replace an element at an arbitrary index in the source array with a new element. Then, instead of recalculating the MD5 sum by re-running it through an MD5 function, I would like to "subtract" the old element from the result and "add" the new piece of data to it.

To be a bit more clear, here's some pseudo-Scala:

class Block {
  var summary: MD5Result

  // The key reason behind this question is that the elements might not be
  // loaded. With a large array, it can get expensive to load everything just to
  // update a single thing.
  var data: Array[Option[Element]]

  def replaceElement(block: Block, index: Integer, newElement: Element) = {
    // we can know the element that we're replacing
    val oldElement = block.data(index) match {
        case Some(x) => x
        case None    => loadData(index) // <- this is expensive
      }

    // update the MD5 using this magic function
    summary = replaceMD5(summary, index, oldElement, newElement)
  }
}

Is replaceMD5 implementable? While all signs point to "this is breaking a (weak) cryptographic hash," the actual MD5 algorithm seems to support doing this (but I might be missing something obvious).

TTBOMK MD5 computation processes bytes strictly in increasing order. If so, you could record the sequence of intermediate (state) values of the MD5 computation after each 64-byte unit: then if data[i] is changed, you could restart the MD5 computation from this point, i.e. recalculate just the remaining (n-i+1)*64 bytes. If changes are uniformly at random this will save on average half the computation. TTBOMK any change will alter all "downstream" states in an unpredictable way, so I doubt anything can be done to mitigate near-the-start changes. — j_random_hacker
I believe that this is not possible without spending more time then just rerunning md5. Can you please tell why do you believe that the actual MD5 algorithm seems to support doing this? — Salvador Dali
The issue isn't the computation time of re-running the algorithm -- it's that I have to perform an expensive operation (IOs) to figure out what data to even feed the algorithm. — Travis Gockel
@SalvadorDali: Obviously MD5 was not designed with this sort of crazy operation in mind. I don't have an explicit reason to believe this is possible, but it seems like it should be, if it is computationally expensive. If it's not possible, then so be it. — Travis Gockel

j_random_hacker j_random_hacker · Accepted Answer · 2016-02-17T20:54:48

I think I better understand what you want to do now. My solution below assumes nothing about MD5 computation, but involves a tradeoff between IO and storing a large number of MD5 hashes. Instead of computing the simple MD5 hash of the entire dataset, it computes a different MD5 hash that nevertheless should have the same important property: that any change to any element (drastically) changes it.

At the outset, decide on a block size b such that
- you can afford to read b values from disk (or whatever IO you're talking about) per change of element, and
- you can afford to keep 2n/b MD5 hashes in memory.
Create a binary tree of MD5 hashes. Each leaf in this tree will be the MD5 hash of a size-b block. Each internal node is the MD5 hash of its two children. We will use the hash of the root of this tree as "the" MD5 hash.
When element i changes, read in the b elements in block RoundDown(i/b), compute the new MD5 hash for this, and then propagate the changes up the tree (this will take at most log2(n) steps).

Backing out data from an MD5 checksum

1 Answers