Hit a repeat of an earlier incident:
https://lists.torproject.org/pipermail/tor-relays/2016-January/008621.html
message from tor daemon is
Resolved [scrubbed] which was already resolved; ignoring
About 5400 of these messages over 37 hours, during which the relay dropped down to 30% of usual load and did not work for Tor Browser when the exit was specifically set. Unlike the earlier incident, DNSSEC was disabled. Probably the IP address for root-server H was incorrect because it changed in December and 'unbound' was built in September. Just installed and configured the current 'named.root' hints file. Both "iterator" and "validator" modules were loaded, though validation was disabled. Changed the config to load only "iterator".
Unbound 1.5.4, tor 0.2.6.10. Running 164 days.
If I see this again I'm planning to attempt purging the 'unbound' request queue with
unbound-control dump_requestlist >reqlist.snap
unbound-control flush_requestlist
and if that doesn't work, bouncing 'unbound' without stopping the 'tor' daemon.
Interesting stats were recorded, starting with the good pre-incident entries:
Mar 11 05:53 notice: sendto failed: Invalid argument notice: remote address is 0.0.0.1 port 53 ...repeated 4 times
Mar 15 17:45 server stats for thread: 1029423 queries, 415824 answers from cache, 613599 recursions, 0 prefetch server stats for thread: requestlist max 132 avg 22.5692 exceeded 0 jostled 0 average recursion processing time 4.070100 sec
Mar 16 05:46 Resolved [scrubbed] which was already resolved; ignoring
Mar 16 17:45 server stats for thread: 1162172 queries, 287662 answers from cache, 874510 recursions, 0 prefetch server stats for thread: requestlist max 421 avg 198.684 exceeded 0 jostled 0 average recursion processing time 25.833646 sec
Mar 16 21:08 notice: sendto failed: Invalid argument notice: remote address is 0.0.0.1 port 53 ...repeated 4 times
Mar 17 16:52 notice: sendto failed: Invalid argument notice: remote address is 0.0.0.1 port 53 ...repeated 4 times
Mar 17 17:45 server stats for thread: 1078496 queries, 144557 answers from cache, 933939 recursions, 0 prefetch server stats for thread: requestlist max 459 avg 296.668 exceeded 0 jostled 0 average recursion processing time 40.621863 sec
service stopped (unbound 1.5.4).
Mar 17 19:28 server stats for thread: 53991 queries, 3455 answers from cache, 50536 recursions, 0 prefetch server stats for thread: requestlist max 342 avg 290.977 exceeded 0 jostled 0 average recursion processing time 35.947563 sec
If anyone has seen this or knows anything please comment. Tried searching but came up with nothing but thread referenced above.
As with the earlier incident, problem came back within hours of restarting the daemons.
Was able to figure out what's happening Operators running 'unbound' take note!
Problem appears to be the result of someone attempting to DDOS a DNS service, in this case GoDaddy.
Ran
lsof -Pn -p <unbnd_pid>
a few times and observed numerous SYN_SENT TCP connections, of of them to 208.109.255.0/24, where GoDaddy DNS servers are found. Appears GoDaddy is rate-limiting or blocking requests from the 'unbound' instance on the relay IP.
Ran
unbound-control dump_requestlist
and see a large queue of requests to GoDaddy. Finally ran
unbound-control dump_infra >infralst
and see 14000 lines similar to
208.109.255.26 cycsErvicioSsAS.coM. expired rto 120000
indicating a huge number of requests have been made to GoDaddy and have expired after 120 seconds.
Presently the quantity of requests has fallen off and the exit is operating fine. Have alarmed the tell-tale log message. When it recurs I expect
unbound-control purge_requestlist
will mitigate the problem. Presently looking into configuring 'ratelimit' feature of 'unbound'. If anyone has already done this successfully please post to this thread.
Possibly this incident is the result of some malware attempting to use some sort of domain "fast flux" or DGA algorithm. Seems improbable anyone would be dumb enough to try to DDOS GoDaddy DNS using Tor.
Problem came back again while I was working on the exit.
unbound-control purge_requestlist
does not help but it appears that
unbound-control purge_infra unbound-control purge_requestlist
will clear up the problem without requiring a daemon restart--at least temporarily.
Also tried setting
do-tcp: off
but this did not appear to make a difference.
Seems to me a degenerate interaction between tor's 'eventdns' subsystem and 'unbound' comes into play when this DNS flood/attack occurs. Have an 'info' level log with SafeLogging=0 for a few minutes where the relay was in the bogged-down state and was failing to service Tor Browser requests. If developer is interested in taking a look at this please contact me directly.
This issue is a PIA and if it continues I'll give up on 'unbound' and follow the previous operator, switching to bind9 despite the lesser performance.
On 03/18/2016 02:02 PM, Dhalgren Tor wrote:
This issue is a PIA and if it continues I'll give up on 'unbound' and follow the previous operator, switching to bind9 despite the lesser performance.
Could you try switching to a different upstream DNS provider?
Gave up switched to 'named' and now it's working fine.
Entered BUG: https://trac.torproject.org/projects/tor/ticket/18580
Be advised, anyone running a fast exit with 'unbound' should switch to using 'named'.
tor-relays@lists.torproject.org