[tor-relays] clarification on what Utah State University exit relays store ("360 gigs of log files")

grarpamp grarpamp at gmail.com
Thu Aug 13 06:11:17 UTC 2015

On Wed, Aug 12, 2015 at 7:45 PM, Mike Perry <mikeperry at torproject.org> wrote:
> At what resolution is this type of netflow data typically captured?

Routers originally exported at 100% coverage, then many of them
started supporting sampling at various rates (because routers were
choking and buggy anyways, and netheads were happy with averages),
some only do sampling. Plug flow probes into network taps and you
can do whatever you want (netsec loves this and other tools).

> Are we talking about all connection 5-tuples, bidirectional/total
> transfer byte totals, and open and close timestamps, or more (or less)
> detail than this?
> Are timestamps always included? Are bidirectional transfer bytecounts
> always included? Are subsampled packet headers (or contents)
> sometimes/often included?
> What about UDP sessions? IPv6?
> Information about how UDP is treated would also be useful if/when we
> manage to switch to a UDP transport protocol, independent of any
> padding.

All of the above depends on which flow export version / aggregation you
choose, until you get to v9 and IPFIX, for which you can define your fields.
In short... yes.

Flow endtime is last matching packet seen, but a flow can span records
when the time (therefore space, ie RAM) limited mandatory expiry timers hit.
UDP goes via that, TCP usually via flags. Records can span flows for
which other semantic keys may not exist, as often with UDP.
But DPI can also be used in the exporter to do all sorts of fun stuff and enable
other downstream uses (obviously TLS / IPSEC / crypto break some things there).

Tor already bundles multiple logical flows (only TCP for user today) into some
number of physical TCP flows, UDP transport there might not need
anything special.
But consider looking at average flow lifetimes on the internet. There may
be case for going longer, bundling or turfing across a range of ports to falsely
trigger a record / bloat, packet switching and so forth.

> and having more information about what is typically
> recorded in these cases would be very useful to inform how we might want
> to design padding and connection usage against this and other issues.

"Typical" is really defined by the use case of whoever needs the flows,
be it provisioning, engineering, security, operations, billing, bigdata, etc.
And only limited by the available formats, storage, postprocessing,
and customization. IPFIX and


> I think for various reasons (including this one), we're soon going to
> want some degree of padding traffic on the Tor network at some point
> relatively soon

Really? I can haz cake nao? Or only after I pump in this 3k email and
watch 3k come out the other side to someone otherwise idling ;)

... and/or some other bigdata systems ...

More information about the tor-relays mailing list