[tor-relays] clarification on what Utah State University exit relays store ("360 gigs of log files")

Mike Perry mikeperry at torproject.org
Fri Aug 14 00:47:23 UTC 2015

Sharif Olorin:
> > At what resolution is this type of netflow data typically captured?
> For raw capture, timestamps are typically second-resolution. The
> resolution post-aggregation is a different question. Keep in mind that
> netflow is just the most common example; many networks don't use Cisco
> netflow, but have something that meets the same requirements,
> storing relatively more or less data (e.g., pmacct, bro).
> > Are we talking about all connection 5-tuples, bidirectional/total
> > transfer byte totals, and open and close timestamps, or more (or less)
> > detail than this?
> That's about right; some systems (e.g., pmacct in some configurations)
> store a four-tuple of (src,dest,tx,rx) while throwing out the
> ports and aggregating over the tx and rx flows such that connections
> can no longer be uniquely identified. What's stored from Cisco netflow
> is quite flexible[0]. Other systems like bro default to storing
> one record per connection, with all the information in a five-tuple
> plus things like IP TOS and byte counts.
> > Are timestamps always included?
> Yes, to some granularity (there's not much point in storing connection info
> without times, for any of the reasons people normally store connection
> info). The most recent system I set up (bro) records connections with
> second-precision timestamps; the one before that (pmacct) stored
> aggregates over ten seconds (src,dest,tx,rx).

So in the bro-based system (which sounds higher resolution) the final
logged data was second-precision timestamps on full connection tuples?

So if I have a connection to a Tor Guard node opened for 8 hours, at the
end of the session, your system would record a single record with:

Or would it record 8*60*60 == 28800 records, with one record stored per
second that the connection was open/active?

> > I think for various reasons (including this one), we're soon going to
> > want some degree of padding traffic on the Tor network at some point
> > relatively soon, and having more information about what is typically
> > recorded in these cases would be very useful to inform how we might want
> > to design padding and connection usage against this and other issues.
> arma or others can probably explain why this is a hard problem; I
> don't know enough in this area to comment.

I think any system that is storing connection-level data (as opposed to
one record per timeslice of activity on a tuple) is likely to be rather
easy to defend against correlation.

I also think that systems that store only sampled data will also be very
easy to defend against correlation. Murdoch's seminal IX-analsysis work
required 100-500M transfers to get any accuracy out of sample-based
correlation at all, and even then the false positives were a serious
problem, even when correlating a small number of connections.

We have a huge problem right now where all of the research in this area
claimed extremely effective success rates, and swept any mitigating
factors under the rug (especially false positives and the effects of
large amounts of concurrent users or additional activity).
> > Information about how UDP is treated would also be useful if/when we
> > manage to switch to a UDP transport protocol, independent of any
> > padding.
> I don't think UDP helps you at all here. What makes you think it might?

Well, it seems harder to store a full connection tuple for open until
close, because you have no idea when the connection actually closed
(unless you are recording a tuple for every second during which there is
any activity, or similar).

Mike Perry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Digital signature
URL: <http://lists.torproject.org/pipermail/tor-relays/attachments/20150813/e42a4e8e/attachment.sig>

More information about the tor-relays mailing list