tor's definition of 'median'

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hi, https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n2028
If 3 or more authorities provide a Measured= keyword for a router, the authorities produce a consensus containing a "w" Bandwidth= keyword equal to the median of the Measured= votes.
a random sample from recent votes: grep 37.59.38.117 -A 3 *|grep Measured w Bandwidth=6869 Measured=7570 w Bandwidth=6869 Measured=15500 w Bandwidth=6869 Measured=18100 w Bandwidth=6869 Measured=30500 Tor says the median value is 15500 2015-08-10-16-00-00-consensus: w Bandwidth=15500 but the median of these 4 values is actually: (18100+15500)/2 = 16800 no? Has tor a different definition of 'median' and simply takes always the second ordered measurement vote out of 4 votes or is there a bug in the spec or implementation? -----BEGIN PGP SIGNATURE----- iQIcBAEBCgAGBQJVyNtYAAoJEFv7XvVCELh0YU8P/0hWhCLfafvFDPdAfUwUJFPA A1ZA946JGKg1Vjy+ch3tffjDHRosXt7K5U33N9rKMUlW9ul2BQ+uNgzK4eTbghHz yCMn6D+uLk1xruYTsIUZ+Pk/ZywaUKj/FngohVvQnaJIgJCHCEnCIqqBNEK0PjUh EBzI5GdyWpEA2fh55PSRuSNCVzbiVhGwYSKgrVwFrFcKr9iPLmdZsa9SD1mb1PD1 IZOkU6TnSnFGbPaN+pHYdNr5/QJA1/08yYBpG7qAJaTCAMvMif7bKMJ7ElYo7opc oXcECIcBBPnATgKvbO47dbSnX3s+vPMsrRhhdTa1BMr1MIzVG3RWGhNSJwXXQTJh jyaPGj10JRWNxDsY0ro001MX0HymmXeLLCkY4nWsUqPXDiZcQe4oRzLQas5bOI94 ct4tCavj9pRNp2XYCWe631gqcsQ3xV7y37dWUCvpdXqt2NC1B7j2w/Y/UiuNArSj QBtS4Ap5wqnU5JySjndi+lIPOlaPk9uitmzpKLxNF9fnpI6ZECP3T50vHVeZiiQ9 7o1ZyBsPLk3bi2hRy8ZJ0wyO+5WF8bdrlsKXSR40UjrwQZ5kCQf6xmWiSg9PqZFU rIL14yCOsQkR3P4/IgMlL8dtgFk3Asmkkx039144fRKveEhffFjo54CUMhg/jLra EF5cUqz4QdgcpRvKa/5h =lA/6 -----END PGP SIGNATURE-----

On Mon, Aug 10, 2015 at 1:11 PM, nusenu <nusenu@openmailbox.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Hi,
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n2028
If 3 or more authorities provide a Measured= keyword for a router, the authorities produce a consensus containing a "w" Bandwidth= keyword equal to the median of the Measured= votes.
a random sample from recent votes:
grep 37.59.38.117 -A 3 *|grep Measured w Bandwidth=6869 Measured=7570 w Bandwidth=6869 Measured=15500 w Bandwidth=6869 Measured=18100 w Bandwidth=6869 Measured=30500
Tor says the median value is 15500
2015-08-10-16-00-00-consensus: w Bandwidth=15500
but the median of these 4 values is actually: (18100+15500)/2 = 16800 no?
Has tor a different definition of 'median' and simply takes always the second ordered measurement vote out of 4 votes or is there a bug in the spec or implementation?
There's one misplaced throwaway sentence in dir-spec.txt: " All ties in computing medians are broken in favor of the smaller or earlier item. " We should bring this, and probably other things, into a "definitions" section earlier in dir-spec.txt. Patches welcome. ;) -- Nick

Is there some implementation-specific reason not to use the standard mathematical definition of "median"? If not, I propose changing the implementation to become it. -V On Tue, Aug 11, 2015 at 2:44 AM Nick Mathewson <nickm@alum.mit.edu> wrote:
On Mon, Aug 10, 2015 at 1:11 PM, nusenu <nusenu@openmailbox.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Hi,
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n2028
If 3 or more authorities provide a Measured= keyword for a router, the authorities produce a consensus containing a "w" Bandwidth= keyword equal to the median of the Measured= votes.
a random sample from recent votes:
grep 37.59.38.117 -A 3 *|grep Measured w Bandwidth=6869 Measured=7570 w Bandwidth=6869 Measured=15500 w Bandwidth=6869 Measured=18100 w Bandwidth=6869 Measured=30500
Tor says the median value is 15500
2015-08-10-16-00-00-consensus: w Bandwidth=15500
but the median of these 4 values is actually: (18100+15500)/2 = 16800 no?
Has tor a different definition of 'median' and simply takes always the second ordered measurement vote out of 4 votes or is there a bug in the spec or implementation?
There's one misplaced throwaway sentence in dir-spec.txt:
" All ties in computing medians are broken in favor of the smaller or earlier item. "
We should bring this, and probably other things, into a "definitions" section earlier in dir-spec.txt. Patches welcome. ;)
-- Nick _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

I think you are confusing the median with the mean: https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Mean Taking the median instead of the mean can be beneficial in situations where you have larger outliers in your data, which typically affect the mean very much. -j Virgil Griffith:
Is there some implementation-specific reason not to use the standard mathematical definition of "median"? If not, I propose changing the implementation to become it.
-V
On Tue, Aug 11, 2015 at 2:44 AM Nick Mathewson <nickm@alum.mit.edu> wrote:
On Mon, Aug 10, 2015 at 1:11 PM, nusenu <nusenu@openmailbox.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Hi,
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n2028
If 3 or more authorities provide a Measured= keyword for a router, the authorities produce a consensus containing a "w" Bandwidth= keyword equal to the median of the Measured= votes.
a random sample from recent votes:
grep 37.59.38.117 -A 3 *|grep Measured w Bandwidth=6869 Measured=7570 w Bandwidth=6869 Measured=15500 w Bandwidth=6869 Measured=18100 w Bandwidth=6869 Measured=30500
Tor says the median value is 15500
2015-08-10-16-00-00-consensus: w Bandwidth=15500
but the median of these 4 values is actually: (18100+15500)/2 = 16800 no?
Has tor a different definition of 'median' and simply takes always the second ordered measurement vote out of 4 votes or is there a bug in the spec or implementation?
There's one misplaced throwaway sentence in dir-spec.txt:
" All ties in computing medians are broken in favor of the smaller or earlier item. "
We should bring this, and probably other things, into a "definitions" section earlier in dir-spec.txt. Patches welcome. ;)
-- Nick _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

I mean the median.
From Wikipedia...
For example, if *a* < *b* < *c*, then the median of the list {*a*, *b*, *c*} is *b*, and, if *a* < *b* < *c* < *d*, then the median of the list {*a*, *b* , *c*, *d*} is the mean of *b* and *c*; i.e., it is (*b* + *c*) / 2. -V On Tue, Aug 11, 2015 at 9:29 PM John <oneofthem@riseup.net> wrote:
I think you are confusing the median with the mean:
https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Mean
Taking the median instead of the mean can be beneficial in situations where you have larger outliers in your data, which typically affect the mean very much.
-j
Virgil Griffith:
Is there some implementation-specific reason not to use the standard mathematical definition of "median"? If not, I propose changing the implementation to become it.
-V
On Tue, Aug 11, 2015 at 2:44 AM Nick Mathewson <nickm@alum.mit.edu> wrote:
On Mon, Aug 10, 2015 at 1:11 PM, nusenu <nusenu@openmailbox.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Hi,
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n2028
If 3 or more authorities provide a Measured= keyword for a router, the authorities produce a consensus containing a "w" Bandwidth= keyword equal to the median of the Measured= votes.
a random sample from recent votes:
grep 37.59.38.117 -A 3 *|grep Measured w Bandwidth=6869 Measured=7570 w Bandwidth=6869 Measured=15500 w Bandwidth=6869 Measured=18100 w Bandwidth=6869 Measured=30500
Tor says the median value is 15500
2015-08-10-16-00-00-consensus: w Bandwidth=15500
but the median of these 4 values is actually: (18100+15500)/2 = 16800 no?
Has tor a different definition of 'median' and simply takes always the second ordered measurement vote out of 4 votes or is there a bug in the spec or implementation?
There's one misplaced throwaway sentence in dir-spec.txt:
" All ties in computing medians are broken in favor of the smaller or earlier item. "
We should bring this, and probably other things, into a "definitions" section earlier in dir-spec.txt. Patches welcome. ;)
-- Nick _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Virgil's absolutely right. Median as the "middle" value in a _sorted_ set is: - for odd number of data points, it's the middle one: set[N/2] - for even number of data points, it's the average of two in the middle: (set[N/2] + set[(N+1)/2]) / 2 Best regards, Maciej On Tue, Aug 11, 2015 at 3:44 PM, Virgil Griffith <i@virgil.gr> wrote:
I mean the median.
From Wikipedia...
For example, if *a* < *b* < *c*, then the median of the list {*a*, *b*, *c*} is *b*, and, if *a* < *b* < *c* < *d*, then the median of the list { *a*, *b*, *c*, *d*} is the mean of *b* and *c*; i.e., it is (*b* + *c*) / 2.
-V
On Tue, Aug 11, 2015 at 9:29 PM John <oneofthem@riseup.net> wrote:
I think you are confusing the median with the mean:
https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Mean
Taking the median instead of the mean can be beneficial in situations where you have larger outliers in your data, which typically affect the mean very much.
-j
Virgil Griffith:
Is there some implementation-specific reason not to use the standard mathematical definition of "median"? If not, I propose changing the implementation to become it.
-V
On Tue, Aug 11, 2015 at 2:44 AM Nick Mathewson <nickm@alum.mit.edu> wrote:
On Mon, Aug 10, 2015 at 1:11 PM, nusenu <nusenu@openmailbox.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Hi,
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n2028
If 3 or more authorities provide a Measured= keyword for a router, the authorities produce a consensus containing a "w" Bandwidth= keyword equal to the median of the Measured= votes.
a random sample from recent votes:
grep 37.59.38.117 -A 3 *|grep Measured w Bandwidth=6869 Measured=7570 w Bandwidth=6869 Measured=15500 w Bandwidth=6869 Measured=18100 w Bandwidth=6869 Measured=30500
Tor says the median value is 15500
2015-08-10-16-00-00-consensus: w Bandwidth=15500
but the median of these 4 values is actually: (18100+15500)/2 = 16800 no?
Has tor a different definition of 'median' and simply takes always the second ordered measurement vote out of 4 votes or is there a bug in the spec or implementation?
There's one misplaced throwaway sentence in dir-spec.txt:
" All ties in computing medians are broken in favor of the smaller or earlier item. "
We should bring this, and probably other things, into a "definitions" section earlier in dir-spec.txt. Patches welcome. ;)
-- Nick _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

On Tue, 11 Aug 2015 13:44:48 +0000, Virgil Griffith wrote:
I mean the median.
From Wikipedia...
For example, if *a* < *b* < *c*, then the median of the list {*a*, *b*, *c*} is *b*, and, if *a* < *b* < *c* < *d*, then the median of the list {*a*, *b* , *c*, *d*} is the mean of *b* and *c*; i.e., it is (*b* + *c*) / 2.
In the preceding paragraph wikipedia isn't that strict: 'the median is then usually defined to be the mean of the two middle values', and tor is unusual. :-) Andreas -- "Totally trivial. Famous last words." From: Linus Torvalds <torvalds@*.org> Date: Fri, 22 Jan 2010 07:29:21 -0800

I looked into this. Apparently Tor often uses the "low median", in cases where it needs to be a middle value, but an inbetween value is not allowed. This is chiefly for voting. On Tue, Aug 11, 2015 at 10:49 PM Andreas Krey <a.krey@gmx.de> wrote:
On Tue, 11 Aug 2015 13:44:48 +0000, Virgil Griffith wrote:
I mean the median.
From Wikipedia...
For example, if *a* < *b* < *c*, then the median of the list {*a*, *b*, *c*} is *b*, and, if *a* < *b* < *c* < *d*, then the median of the list {*a*, *b* , *c*, *d*} is the mean of *b* and *c*; i.e., it is (*b* + *c*) / 2.
In the preceding paragraph wikipedia isn't that strict: 'the median is then usually defined to be the mean of the two middle values', and tor is unusual. :-)
Andreas
-- "Totally trivial. Famous last words." From: Linus Torvalds <torvalds@*.org> Date: Fri, 22 Jan 2010 07:29:21 -0800 _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

from today's measurement meeting:
15:00:20 <virgil> karsten: I've decided I'm going to fix the definition of median 15:00:26 <virgil> in the tor sourcecode 15:00:36 <karsten> virgil: is it broken? 15:00:53 <karsten> or just not specified as clearly as it should be? 15:01:01 <virgil> for ordered list {a,b,c,d}, it returns b instead of (b+c)/2. 15:01:24 <karsten> yes. maybe that's for a reason (which I don't know). 15:01:40 <virgil> I look forward to hearing this reason when my patch is rejected. 15:01:41 <karsten> like, using value (b+c)/2 would break for some reason, whereas any of a, b, c, d would be fine. 15:01:45 <Sebastian> you cannot do that 15:01:51 <Sebastian> without breaking Tor's voting 15:02:21 <Sebastian> Tor's specification requires low median for a bunch of directory stuff
I'd be interested in the reason as well.

On Wed, Aug 12, 2015 at 5:34 PM, nusenu <nusenu@openmailbox.org> wrote:
from today's measurement meeting:
15:00:20 <virgil> karsten: I've decided I'm going to fix the definition of median 15:00:26 <virgil> in the tor sourcecode 15:00:36 <karsten> virgil: is it broken? 15:00:53 <karsten> or just not specified as clearly as it should be? 15:01:01 <virgil> for ordered list {a,b,c,d}, it returns b instead of (b+c)/2. 15:01:24 <karsten> yes. maybe that's for a reason (which I don't know). 15:01:40 <virgil> I look forward to hearing this reason when my patch is rejected. 15:01:41 <karsten> like, using value (b+c)/2 would break for some reason, whereas any of a, b, c, d would be fine. 15:01:45 <Sebastian> you cannot do that 15:01:51 <Sebastian> without breaking Tor's voting 15:02:21 <Sebastian> Tor's specification requires low median for a bunch of directory stuff
I'd be interested in the reason as well.
The correct fix here is to update the code documentation to define the functions as returning the low-median, and to update dir-spec.txt to say so too. I'd accept documentation patches like that. Changing the code to return the mean of the two center elements from even arrays would break all authority voting, and wouldn't actually be useful. -- Nick

On 14 Aug 2015, at 03:10 , nusenu <nusenu@openmailbox.org> wrote:
Changing the code to return the mean of the two center elements from even arrays would break all authority voting, and wouldn't actually be useful.
Yes, that is what Sebastian said on IRC as well. Can you shed some light as to why it would break voting?
If the authorities supply different values in the consensus, voting breaks. Authorities using the low-median would supply one value, and authorities using the mean-median would supply another value. (Authorities typically run different versions of tor, and don't upgrade all at once.) Breaking changes like this are typically negotiated among the authorities using numbered consensus methods. Once enough authorities support a new consensus method, it is activated during voting. Rather than creating a new consensus method to implement mean-median, it's much easier to patch the documentation to specify low-median. (And I see no significant gain in changing from low-median to mean-median.) I'd rather see bandwidth measurements become more accurate, for more relays, more of the time, than change how their median is defined. Tim Tim Wilson-Brown (teor) teor2345 at gmail dot com pgp ABFED1AC https://gist.github.com/teor2345/d033b8ce0a99adbc89c5 teor at blah dot im OTR D5BE4EC2 255D7585 F3874930 DB130265 7C9EBBC7

On 13 Aug 2015, at 18:50, Nick Mathewson <nickm@alum.mit.edu> wrote:
On Wed, Aug 12, 2015 at 5:34 PM, nusenu <nusenu@openmailbox.org> wrote:
from today's measurement meeting:
15:00:20 <virgil> karsten: I've decided I'm going to fix the definition of median 15:00:26 <virgil> in the tor sourcecode 15:00:36 <karsten> virgil: is it broken? 15:00:53 <karsten> or just not specified as clearly as it should be? 15:01:01 <virgil> for ordered list {a,b,c,d}, it returns b instead of (b+c)/2. 15:01:24 <karsten> yes. maybe that's for a reason (which I don't know). 15:01:40 <virgil> I look forward to hearing this reason when my patch is rejected. 15:01:41 <karsten> like, using value (b+c)/2 would break for some reason, whereas any of a, b, c, d would be fine. 15:01:45 <Sebastian> you cannot do that 15:01:51 <Sebastian> without breaking Tor's voting 15:02:21 <Sebastian> Tor's specification requires low median for a bunch of directory stuff
I'd be interested in the reason as well.
The correct fix here is to update the code documentation to define the functions as returning the low-median, and to update dir-spec.txt to say so too. I'd accept documentation patches like that.
The documentation for the code already says that. The spec could be updated to say low-median consistently, tho. Cheers Sebastian
participants (8)
-
Andreas Krey
-
John
-
Maciej Soltysiak
-
Nick Mathewson
-
nusenu
-
Sebastian Hahn
-
teor
-
Virgil Griffith