Stream Reasons and "suspects" vs "actual" failures

Mon Nov 13 20:58:39 UTC 2006

On Wed, Nov 08, 2006 at 03:42:08AM -0600, Mike Perry wrote:
> Ok, I've written all the infrastructure into my scanner to separate
> circuit and stream failures into "suspects" and "actual" failure
> points (along with some other nifty updates for content scanning). 
> 
> The idea, as I mentioned in a previous post, is to have the "actual"
> list be created on the assumption that there are no malicious nodes
> and only count stats for nodes that (under this assumption) we are
> sure caused the failure, where as the "suspects" list will equally
> blame everyone who could have possibly caused a failure, maliciously
> or even just by bug.

I wonder about these terms from a usability POV.  The word "suspect"
kinda implies that the server is likely to be ill-behaved
intentionally because of its association with criminal investigations;
"actual" implies a degree of certainty.  Are there other terms that
would more closely describe what you're actually measuring?

> For circuits it works like this: For the "actual" list, when a circuit
> fails to extend, it must have happened because the node it attempted
> to extend to is messed up.  However, the "suspects" list blames
> everyone involved so far in that circuit, on the assumption that any
> of them could have caused the failure (either maliciously or perhaps
> just due to some weird cell lossage).

Just so you know, this approach is somewhat problematical.  One of the
big problems in reputation systems for anonymity nets is that not only
is it hard to track who is responsible for failures, but also that a
clever adversary who knows what approach you're using can manipulate
you into blaming the wrong person.  For 

Roger wrote a couple of good papers about this (one with Mike Freedman,
David Hopwood, and David Molnar; one with Paul Syverson).  They assume
a high-latency mixnet rather than an onion routing system, but a lot
of the analysis is still applicable.

   http://freehaven.net/doc/mix-acc/mix-acc.pdf
   http://freehaven.net/doc/casc-rep/casc-rep.pdf

The "creeping death" attack in the latter paper is particularly
worrisome.

On the whole, I think the best you can do is try to collect
fine-grained stats, and not get too fancy about how you aggregate
them.  For instance, if a disproportionate number of attempts to
extend *from* A or *to* B fail, either one is interesting.

> I'm now trying to decide which stream reasons I should blame on the
> exit versus which I should blame on every node in the circuit. The
> source is kind of hard to follow w/ this.. at a guess I'm thinking
> that exit-specific reasons are everything except: HIBERNATING, MISC,
> TIMEOUT, TORPROTOCOL, DESTROY, and DONE (no error?). Any others?

First, I'd track the reasons independently.  One of the big
discoveries you've made so far is that it's way more common for
screwed up servers to be damaged rather than malicious, and having
more information here will always be useful in debugging mess-ups.

As for the codes, all *remote* stream end codes are generated by the
exit node, so if the exit node is lying, you can't believe any of
them.  But if the exit node is trying to be honest, you can interpret
remote reasons like this:

MISC doesn't tell us much; it might be the exit node's fault; it might
be old code.

RESOLVEFAILED is probably either a nonexistent destination or a bad
DNS server at the exit.

CONNECTREFUSED is either a bad network connection at the exit, a bad
exit, or a bad/overloaded destination.

EXITPOLICY is what it says: you tried to make the exit node do
something its policy didn't support.

DESTROY shouldn't occur as a remote reason unless the exit node itself
tries to tear down the circuit, and maybe not even then.

DONE isn't an error, unless it's spuriously reported.  In that case,
it's either a bad network connection at the exit, a bad exit, or a
bad/overloaded destination.

TIMEOUT is either a bad network connection at the exit, a bad
exit, or a bad/overloaded destination.

HIBERNATING is not an error; it's a sleepy exit.

INTERNAL probably means that the exit node crashed or ran out of
resources in an unexpected way or hit a bug.

RESOURCELIMIT is definitely the exit node running out of resources.

CONNRESET could be the exit node's fault, or the website's fault, or
the fault of the connection between the exit node and the website.

TORPROTOCOL should never happen; streams only end because of PROTOCOL
when somebody said something that didn't conform to the Tor protocol.
This could be a bug in somebody's code, or a version incompatibility.
It is unlikely to be an attack.

NOTDIRECTORY should never happen unless the client has an out-of-date
version of the server's descriptor.

(It's a little more complicated for non-remote reasons.  I'll have to
look at the code more for that.)

> Conversely, are there any exceptions for the "suspects" list where we
> can say for sure that a specific node is at fault no matter what for a
> particular failure reason, for either circuits or streams?

For streams, since remote reasons only come from the exit node, you
can be sure in the case where the exit node says, "closing, my fault."
But if it says, "closing, not my fault", there's no way to be sure.
For circuit reasons, there's no way to be certain unless you're at
the first hop: any DESTROY reason that could make you suspect one node
on your circuit could have been forged by an earlier node on the
circuit (or perhaps caused by a later node).

hth,
-- 
Nick Mathewson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 652 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20061113/29d5aded/attachment.pgp>