Hello!
There has been some confusion among relay operators about how we deal with bad relays and who is actually making decisions and how the overall process is working. Even though we don't have a document yet to point to for answering all those questions (more on that below) we thought it could be useful to give a status update to the relay community and outline possible next steps.
One of the tasks in the network health area is to make sure bad relays are found and excluded from the network. This should happen according to transparent criteria which help relay operators to understand both expectations and processes. Ideally, a document containing those criteria would give operators some insight at how we arrived at those as well.
Unfortunately and as I said above, we are not at a point yet where we have written up that document. However, that does not mean that removing relays from the network is arbitrary currently. Rather, we have some rules of thumb and some unwritten guidelines which still seem to be worth sharing at this point to help relay operators better understand what is going on in the bad relay detection world.
A bad relay is one that either doesn't work properly or tampers with users' connections. This can be either through maliciousness or misconfiguration. We are relying on some scanners that check for common issues to find those relays and on volunteers that spot things beyond what our scanners target.
To give you some examples of issues we are concerned about:
a) Tampering with exit traffic b) Running HSDirs that harvest and probe .onion addresses c) Issues with resolving DNS queries on exit relays d) Flooding the network with relays to deanonymize users e) Running outdated Tor versions ...
Now, how do we detect maliciousness vs. misconfiguration and what do we do about both?
There is behavior that we think is clearly malicious like tampering with exit traffic or trying to harvest and probe .onion addresses. In those cases we outright reject relays. In the past we thought relays that tampered with exit traffic could still be useful as non-exit relays and they got the BadExit flag. But it turned out that a bunch of those had other, more subtle, misbehavior and thus we decided to be on the safe side and just reject those malicious relays nowadays.
For behavior that could either be malicious or the result of a misconfiguration (like missing MyFamily settings) things get messier.
Means for contacting relay operators (e.g. a meaningful ContactInfo entry) are very important in cases where misconfiguration can play a role. We usually contact operators in that case (if possible) to figure out what is going on and help them getting their configurations right. That means there is no outright force removal of relays that e.g. did not have their MyFamily configuration set up properly (we know it can be tricky). That approach is successful in a lot of cases and helps us build a relationship to operators which is worthwhile as well. However, in cases where we don't get a reaction or are getting confident that the intentions of the operator are malicious we'll reject the relay(s) to protect our users.
All those activities mentioned above are coordinated on the bad-relays list, which is private and used by members of the team to discuss cases and keep each other in the loop.
As to next steps: yes, we need to sit down finishing that document with all the criteria we are concerned with giving some rationale for each of them. Alas, there is no timeframe for getting this work done. But once we are there we'll consult tor-internal and the tor-relays list for input and make changes as needed.
I hope this helps to clear some things up. I am happy to answer questions/reply to concerns on and off-list should there be some.
Georg