On Tue, Jul 07, 2020 at 01:01:12AM +0200, nusenu wrote:
https://gitlab.torproject.org/tpo/metrics/relay-search/-/issues/40001
thanks, I'll reply here since I (and probably others) can not reply there.
Fwiw, anybody who wants a gitlab account should just ask for one. Don't be shy. :)
The instructions for asking are here: https://gitlab.torproject.org/users/sign_in
(A) Limiting each "unverified" relay family to 0.5% doesn't by itself limit the total fraction of the network that's unverified. I see a lot of merit in another option, where the total (global, network-wide) influence from relays we don't "know" is limited to some fraction, like 50% or 25%.
I like it (it is even stricter than what I proposed), you are basically saying the "known" pool should always control a fixed (or minimal?) portion - lets say 75% - of the entire network no matter what capacity the "unknown" pool has
Right.
but it doesn't address the key question: How do you specifically define "known" and how do you verify entities before you move them to the "known" pool?
Well, the first answer is that these are two separate mechanisms, which we can consider almost independently:
* One is dividing the network into known and unknown relays, where we reserve some minimum fraction of attention for the known relays. Here the next steps are to figure out how to do load balancing properly with this new parameter (mainly a math problem), and to sort out the logistics for how to label the known relays so directory authorities can assign weights properly (mainly coding / operator ux).
* Two is the process we use for deciding if a relay counts as known. My suggested first version here is that we put together a small team of Tor core contributors to pool their knowledge about which relay operators we've met in person or otherwise have a known social relationship with.
One nice property of "do we know you" over "do you respond to mail at a physical address" is that the thing you're proving matters into the future too. We meet people at relay operator meetups at CCC and Fosdem and Tor dev meetings, and many of them are connected to their own local hacker scenes or other local communities. Or said another way, burning your "was I able to answer a letter at this fake address" effort is a different tradeoff than burning your "was I able to convince a bunch of people in my local and/or international communities that I mean well?"
I am thinking back to various informal meetings over the years at C-base, Hacking At Random, Defcon, etc. The "social connectivity" bond is definitely not perfect, but I think it is the best tool available to us, and it provides some better robustness properties compared to more faceless "proof of effort" approaches.
That said, on the surface it sure seems to limit the diversity we can get in the network: people we haven't met in Russia or Mongolia or wherever can still (eventually, postal service issues aside) answer a postal letter, whereas it is harder for them to attend a CCC meetup. But I think the answer there is that we do have a pretty good social fabric around the world, e.g. with connections to OTF fellows, the communities that OONI has been building, etc, so for many places around the world, we can ask people we know there for input.
And it is valuable for other reasons to build and strengthen these community connections -- so the incentives align.
Here the next step is to figure out the workflow for annotating relays. I had originally imagined some sort of web-based UI where it leads me through constructing and maintaining a list of fingerprints that I have annotated as 'known' and a list annotated as 'unknown', and it shows me how my lists have been doing over time, and presents me with new not-yet-annotated relays.
But maybe a set of scripts, that I run locally, is almost as good and much simpler to put together. Especially since, at least at first, we are talking about a system that has on the order of ten users.
One of the central functions in those scripts would be to sort the annotated relays by network impact (some function of consensus weight, bandwidth carried, time in network, etc), so it's easy to identify the not-yet-annotated ones that will mean the biggest shifts. Maybe this ordered list is something we can teach onionoo to output, and then all the local scripts need to do is go through each relay in the onionoo list, look them up in the local annotations list to see if they're already annotated, and present the user with the unannotated ones.
To avoid centralizing too far, I could imagine some process that gathers the current annotations from the several people who are maintaining them, and aggregates them somehow. The simplest version of aggregation is "any relay that anybody in the group knows counts as known", but we could imagine more complex algorithms too.
And lastly, above I said we can consider the two mechanisms "almost independently" -- the big overlap point is that we need to better understand what fraction of the network we are considering "known", and make sure to not screw up the load balancing / performance of the network too much.
(2) We need to get rid of http and other unauthenticated internet protocols:
This is something browser vendors will tackle for us I hope, but it will not be anytime soon.
Well, we could potentially tackle it sooner than the mainstream browser vendors. See https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/19850#no... where maybe (I'm not sure, but maybe) https-everywhere has a lot of the development work already done.
--Roger