thandy repository mirrorability

Peter Palfrader peter at
Wed Aug 4 16:03:42 UTC 2010


I've been wondering about mirrorability for when we start using thandy
for real.  That is, what happens when a user accesses a mirror that is
in the process of updating its files.

Let me describe first how Debian does things, so we have a nice example
of the problem:

In your normal Debian archive tree we have normal files (.deb and
.tar.gz and all that);  in Debian they all live below pool/.

We have a list of various subsets of them with their corresponding
digests in files called Sources and Packages (one per
arch/suite/component but that's not really that important).  And we have
a list of all those metafiles with their hashes in a file called
Release, and the pgp signature of that file is next to it and named
Release.pgp.  All these files live under dist/.

When a user wants to update their system or install a new package they
first run 'apt-get update' which fetches all the metadata files
(Packages, Release, Release.pgp) and later they run apt-get install or
upgrade (or ..) which gets the .deb.

This is all fine and no problem if the archive is static almost all of
the time.  Of course it isn't and that introduces two problems when
updating and especially when mirroring the archive:

- The Sources/Packages file might refer to files that are not yet or
  no longer on the server.
- The various checksums might be out of sync
  + the .pgp file might not go with the Release file a client downloaded
  + the Release file might reference a Packages/Sources file with a
    different hash.
  + Packages/Sources files referencing files with their wrong digest
    does not happen in debian because files below pool/ never change -
    they get added and removed but they don't ever get modified once

Debian mirroring solves the problem partially by doing the mirroring
process in two stages (and nothing accesses the master repository, so
everything a user sees always is a mirror).

The first stage just fetches new files below pool/.[1]  It does not
ever delete anything.

The second stage then updates the rest of the archive bug using rsync's
--delay-updates option which means rsync first gets a new version of all
the files that changed (or are new), stores them under some temporary
name and once it has all data locally it walks over the tree and renames
stuff into place.  We kinda pretend that this rename step is atomic.
The second stage also runs with --delete and --delete-after, removing
all the files that are no longer on the master at the end of the run.

In addition to this two stage mirroring process the master archive does
not remove files immediately after they stop being referenced but waits
for a couple of days before they get deleted.

In theory that's quite simple but it does have its pitfalls.  For one
everybody who wants to mirror the archive has to run your own special
mirroring script (and people don't really like doing that).
Secondly, it becomes really painful when you have a rotation like that is many machines but should present the same
view of data to users regardless of which individual machine they get
their data form.  I.e. you have to do some form of synchronized staged

So, from a quick look at thandy and without knowing much about it, it
appears as if thandy will suffer from much of the same problems.  The
timestamp.txt file looks like one that's particular problematic.  Is
this correct or is there some clever scheme that avoids the desync
problems while a mirror update is in progress?

Should we worry about this and try to see if we can come up with some
clever schemes that mitigate or avoid the issue?


1. Not exactly true, but close enough for this discussion.
                           |  .''`.  ** Debian GNU/Linux **
      Peter Palfrader      | : :' :      The  universal | `. `'      Operating System
                           |   `-

More information about the tor-dev mailing list