Hello everyone,
This is my first status report regarding my GSoC project, Implementing consensus diffs. We weekly meetings with Nick and Sebastian, my mentors, every Wednesday at 14h UTC, if you want to take a closer look at the project's progress or want to participate.
In the application timeline I specified that I would spend the first weeks coding the server side part, which is mostly just the generation of diffs. And that's what I've been doing thus far. I'm a bit behind on schedule due to exams, but I'm not too far off.
I first implemented a dynamic algorithm, quadratic in both time and space, which seemed to work fine. Of course it didn't work for consensuses, since that would mean an allocation of nearly 3GB. So the next thing I did was switching to Hirschberg's algorithm, which still takes quadratic time but only takes linear space due to a divide-and-conquer approach.
But that still leaves us at a horrible 16 seconds of runtime to generate a regular microdescriptor consensus diff, whereas 'diff -d' only takes a fifth of a second. That is largely due to the fact that we're not taking advantage of the consensus properties yet.
So what I will do next is improve the algorithm, which should make it very close to linear in time. The idea is to navigate the two consensuses, looking at the digests that will already be sorted for us. That's part of the 'r' line for descriptor consensuses, and the 'm' line for microdescriptor consensuses. This should take linear time.
Now, the part of Hirschberg's algorithm that makes it quadratic in time is finding 'k', the column at which the problem can be split in half wihtout losing precision. This is well explained here: http://wordaligned.org/articles/longest-common-subsequence
But once we are navigating the nodes ordered by digests, we can avoid having to calculate 'k'. Given that we are at any pair of matching digests, we can assume that this is already a point 'k' because all the nodes that are before the digest on the first consensus can only be found before the digest on the second consensus, and the same goes for later digests.
So the final algorithm should travel both consensuses in linear time, and use our quadratic time algorithm on the small chunks between these matching digests. The quadratic time algorithm would at worst be used for maybe a couple hundred lines, a size at which it still performs fairly well.
You may be wondering why I went with an O(MN) algorithm when I could have gone with the more common O(ND) algorithm, which I understand is the one used in the GNU diffutils. The main reason is the use-case. O(ND) performs worse than O(MN) when the differences are large, because then D=2N. In our case, when we are running the algorithm it's because nodes disappeared, appeared or simply to find changes in nodes that are still there.
So it will be very common that we find diffs D that are very close in size to N or M. Plus, as far as I understand, the divide-and-conquer algorithm is pretty clean and easy to understand while the O(ND) algorithm by Myers would probably be harder to implement in a clean way.
I'm in no way an expert in diff/lcs algorithms, so suggestions and comments are very welcome.