Anonymity-preserving collection of usage data of a hidden service authoritative directory

Sat May 5 10:10:08 UTC 2007

Hi,

> They also republish whenever they consider their descriptor to be 
> "dirty", which happens when they establish a new introduction point 
> (rend_service_intro_established()) or give up on and drop an
> introduction point (rend_services_introduce()). This 'dirty' part is
> what I meant when I was pondering if a few hidden services have
> unstable connections, and thus change their intro points a lot.

Maybe it's just a personal feeling (because I did not measure that yet), 
but don't you think that introduction points change quite often? I 
always thought that RSDs are republished so often, because it's anyway 
unlikely that the set of IPos stays the same for more than one hour. 
Thus, an RSD being 23 hours old simply cannot have any working IPos any 
more.

>> So it's very unlikely that there will be
>> many novel publications after the shown intervals.
> 
> Yep. They will be people creating a new hidden service, or people 
> turning on their Tor after it's been off for a while. As we see
> above, there are at most a handful in each 15 minute period.

However, novel publications decreased from 3.72 to 0.81 in the mean when 
comparing the two statistics. Maybe this comes from hidden services that 
were offline for some time less than 3 days and "republish" their 
descriptor. Then it would be an artifact coming from the 3-days-rule, 
because it's rather a novel publication with novel IPos than a 
republication.

> But hey, at least we remove old ones sometime, rather than just
> collecting them forever. :)

In German we call people who keep everything because they don't dare to 
throw anything away "Messies"... But maybe we can "heal" that in Tor. ;)

> (Remember that this same logic is used by *clients* to discard old
> service descriptors, and we have many fewer guarantees that their
> clocks are at all correct. That's what the MAX_SKEW business is
> about.)

>> Why would a client
>> expect that a hidden service with a 23-hour old descriptor is
>> online if it knows that it should have republished every hour?
> 
> Well, if the client's clock is wrong by 23 hours, ...

> But you're right, the servers storing the descriptors should be
> assumed to have better clocks, and they could just dump old ones to
> save clients the trouble.

I am not sure if I get your arguments about clock skew right. Doesn't 
clock skew only address two *different* clocks, e.g. a client's and a 
directory node's clock? Then I agree that there should be some tolerance.

But when a directory node receives an RSD, it can note when that was and 
discard it after 1.5 hours using its own clock. Regardless of a client's 
clock, the descriptor is 1.5 hours old when discarding it and -- 
possibly -- useless. The latter depends on how often IPos change. (I 
think this would be the next thing to measure...)

> Of course, the real reason hidden services republish every hour is 
> because the directory authorities don't store anything to disk and
> don't share service descriptors among each other -- so every time we
> restart a directory authority it forgets about all hidden services.
> This means they need to republish frequently just in case an
> authority restarts. If we made some way for service descriptors to
> survive a restart (e.g. by storing them to disk, replicating them, or
> both), then it seems to me we would reduce the need to republish
> dramatically.

The question is whether it is more likely that a directory node restarts 
or that an introduction point changes.

>> In a decentralized design I suggest to cut down the lease time to
>> one hour (or maybe 1.5 hours). This saves resources for replicating
>> descriptors in case of leaving/joining routers.
> 
> This is an interesting tradeoff. I'm not sure if it's better to
> demand frequent "I'm still here" messages from the hidden services,
> so you can quickly drop the ones that don't send one, or to be more
> flexible and let them go long periods with the same intro points and
> never need to send an update.

Maybe 1 hour is too short. 4 hours? 12 hours? We can negotiate that. ;) 
No, to be serious: What do you think how long a set of introduction 
points stays the same -- after a stabilization phase of say 15 minutes 
after starting the service?

> I guess if we want to get extra complex then somebody could try
> connecting to the hidden service and only dump the descriptor if it's
> unreachable -- but that probably doesn't play well with our
> authentication or authorization tricks, nor with the valet node and
> related designs.

Maybe we can postpone this extension? My first thought would be to 
register 1000 fake hidden services at one directory node and wait for it 
to establish 1000 connections to them. :(

> Actually, three. Only "v1" directory authorities handle hidden
> service stuff, and that's just moria1, moria2, and tor26 right now.

Whoops. Yes, you wrote that in an earlier mail that I did not read in 
whole before writing my last mail...

> Yep. This number seems to represent the total count of people
> interacting with a given hidden service, but remember that it doesn't
> represent the total number of rendezvous attempts -- since clients
> cache the descriptors.

Sure.

> Though note in connection_ap_handshake_rewrite_and_attach() that
> clients try to refetch a newer descriptor if the one they have cached
> is more than 15 minutes old. Are you following all the details so
> far? :)

Now that you ask... :) Why 15 minutes? So, clients consider RSDs to be 
old after 15 minutes, servers after 60 minutes, but directories keep 
them for 3 days?...

> There is something that is making the
> rendezvous itself be very slow. I'm not sure what it is. There's no
> need for it to be as slow as it is. And I think it really reduces the
> set of people who think hidden services are neat.

Then this might be one of the next things to investigate. I think it's 
some timeout being too long or some operations that could/should be 
performed twice/three times in parallel.

> Also, scaling questions aside, there are other reasons to distribute 
> hidden service descriptors and improve their availability.

Right.

> So what more data might we want to collect about current usage
> patterns? Or is this enough to move on to the next steps which are to
> think about an ascii format for descriptors (rather than the awful
> binary format I was dumb enough to use back when we started), think
> about the implications of letting strangers see and serve all the
> descriptors, and think about a protocol for receiving, serving, and
> replicating descriptors?

These are the possible next tasks (arbitrary order):
- Find out why connection establishment is that damn slow.
- Measure how often IPos change.
- Think about the format of RSDs (ASCII vs. binary), encryption of 
contents and the related security implications.
- Describe the protocol to receive/serve/replicate RSDs.

But enough measurements for the moment. I think I should think about 
some concepts now and hence will start with the RSD format.

--Karsten