Exit Relay DNS Health Module for Exitmap - Seeking Guidance Before Deployment
Hi Network Health Team, A few large-scale exit relay operators have asked for better visibility into DNS health across their relays. We've built an exitmap module, dnshealth, to address this and want your input before we start running scans and publishing results. What It Does - Generates unique DNS queries per relay (wildcard subdomain → expected IP) to avoid caches - Classifies failures: timeout, NXDOMAIN, wrong IP, SOCKS errors - Outputs structured JSON with latency and error details All code is open source: https://github.com/1aeo/exitmap Initial testing: ~98% success rate across ~3k exits, 50-90 true failures per scan, 4-8 min runtime. Before We Proceed 1. Any concerns with us running regular scans and publishing results? 2. Recommendations on scan frequency or methodology? Happy to adjust our approach based on your guidance.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Hello. Tor at 1AEO wrote:
- Classifies failures: timeout, NXDOMAIN, wrong IP, SOCKS errors
Would this be capable of detecting whether or not nameservers block the entire /64 that an IPv6 exit is on? I've always been concerned about enabling IPv6 in Unbound with the fear that, even if the IP differs, it would still be blocked due to being in the same /64. For IPv4, the solution is expensive but easy: Buy a second IP. For IPv6, it would be really nice if I didn't have to buy a second /64. Regards, forest -----BEGIN PGP SIGNATURE----- iHUEARYKAB0WIQQtr8ZXhq/o01Qf/pow+TRLM+X4xgUCaWwfHQAKCRAw+TRLM+X4 xkSdAP4y8i5fh7crQ4FCB79/HhsAKmMb7KhMI+4roKhMrSMdPgEAjyBTLKPscuNd CzMEfhsSujDdPs7MZbjxfh0I1rHqKg4= =Lkoh -----END PGP SIGNATURE-----
Hi! Tor at 1AEO via network-health:
Hi Network Health Team,
A few large-scale exit relay operators have asked for better visibility into DNS health across their relays. We've built an exitmap module, dnshealth, to address this and want your input before we start running scans and publishing results.
Very nice work, thanks!
What It Does
- Generates unique DNS queries per relay (wildcard subdomain → expected IP) to avoid caches - Classifies failures: timeout, NXDOMAIN, wrong IP, SOCKS errors - Outputs structured JSON with latency and error details
All code is open source: https://github.com/1aeo/exitmap
Initial testing: ~98% success rate across ~3k exits, 50-90 true failures per scan, 4-8 min runtime.
You mean true DNS resolution failures per scan or general ones (including DNS)?
Before We Proceed
1. Any concerns with us running regular scans and publishing results?
It depends on the regularity. What did you have in mind in that regard? We are currently running the dnsresolution module on a weekly basis and informing relay operators in case of trouble and that has been sufficient frequency-wise I think. I don't think there are any publishing concerns, no. Where are the results supposed to show up?
2. Recommendations on scan frequency or methodology?
Yes. I think weekly should be fine at least for a start. As for the methodology it would be very much appreciated if you could upstream you changes to exitmap itself where it makes sense so we don't start creating duplicated infrastructure. Ideally, there would be only one dnsresolution module and not a myriad of different ones.
Happy to adjust our approach based on your guidance.
I tried to provide some, let me know if you had something else in mind or I forgot to address anything. Thanks, Georg
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
This isn’t currently deployed, but by self-hosting the authoritative nameserver (rather than using Cloudflare), we could observe per-attempt wildcard DNS queries (e.g. <timestamp>.<fingerprint>.tor.exit.validator.1aeo.com) and correlate whether those queries reach the authoritative server when initiated by exit relays. When comparing exits within the same IPv6 /64, we would expect to see one of three outcomes: 1) queries arrive directly from the exit IP (recursive resolution) 2) queries arrive via a shared upstream resolver 3) no queries arrive at all, which would be consistent with resolver-side or upstream /64-level blocking Implementing this requires some additional work, so it would be useful to understand how commonly exit operators encounter suspected /64-level DNS blocking in practice. On Saturday, January 17th, 2026 at 3:46 PM, forest-relay-contact--- via network-health <network-health@lists.torproject.org> wrote:
Hello.
Tor at 1AEO wrote:
- Classifies failures: timeout, NXDOMAIN, wrong IP, SOCKS errors
Would this be capable of detecting whether or not nameservers block the entire /64 that an IPv6 exit is on? I've always been concerned about enabling IPv6 in Unbound with the fear that, even if the IP differs, it would still be blocked due to being in the same /64.
For IPv4, the solution is expensive but easy: Buy a second IP. For IPv6, it would be really nice if I didn't have to buy a second /64.
Regards, forest
Appreciate the support and quick response! "You mean true DNS resolution failures per scan or general ones (including DNS)?" Both. Over the last few days, per scan (~3,000 exit relays in the consensus), we’ve observed: ~55 exit relays with true DNS resolution failures, where circuits established successfully but DNS resolution failed with the exit. ~415 exit relays with circuit or infrastructure failures, where we were unable to reach the relay. Breaking that down further: DNS failures (~55 total): ~33 NXDOMAIN responses (SOCKS4: domain not found) ~21 DNS query timeouts (45 seconds, 3 attempts) ~1 case where a wrong IP was returned Circuit / infrastructure failures (~415 total): ~328 cases where the relay channel closed unexpectedly ~59 circuit construction timeouts ~29 circuits closed or destroyed The remaining ~2,570 exit relays (~98%) successfully resolved a unique wildcard DNS query per timestamp and relay fingerprint. "It depends on the regularity. What did you have in mind in that regard?" Our current plan is to run scans every few hours to provide faster feedback when an operator is actively troubleshooting an issue, given that this is a single DNS query per relay. We’ll refine cadence as we get feedback and learn more. "I don't think there are any publishing concerns, no. Where are the results supposed to show up?" Results will be published on a small public site with a simple JSON API for programmatic access (tentatively https://exitdnshealth.1aeo.com/), and integrated into operator-facing views (per family / AROI and per relay) on https://metrics.1aeo.com/. Happy to adjust presentation if there are preferences. "As for the methodology it would be very much appreciated if you could upstream your changes to exitmap itself…" Absolutely. We strongly prefer not to maintain a long-term fork and are very open to submitting pull requests. What’s the preferred workflow from your perspective — start with a PR directly, or discuss scope and structure first? For reference, current code lives at: https://github.com/1aeo/exitmap (working on a branch to make the commits easier to follow for sending upstream) https://github.com/1aeo/exitmap-dns-health-deploy New question, from seeing DNSSEC support in exitmap, is DNSSEC expected to be enabled for exit relays? Hearing mixed viewpoints from operators. Debating whether it should be added to this exit relay DNS health effort and leaning towards yes. On Monday, January 19th, 2026 at 3:14 AM, Georg Koppen via network-health <network-health@lists.torproject.org> wrote:
Hi!
Tor at 1AEO via network-health:
Hi Network Health Team,
A few large-scale exit relay operators have asked for better visibility into DNS health across their relays. We've built an exitmap module, dnshealth, to address this and want your input before we start running scans and publishing results.
Very nice work, thanks!
What It Does
- Generates unique DNS queries per relay (wildcard subdomain → expected IP) to avoid caches - Classifies failures: timeout, NXDOMAIN, wrong IP, SOCKS errors - Outputs structured JSON with latency and error details
All code is open source: https://github.com/1aeo/exitmap
Initial testing: ~98% success rate across ~3k exits, 50-90 true failures per scan, 4-8 min runtime.
You mean true DNS resolution failures per scan or general ones (including DNS)?
Before We Proceed
1. Any concerns with us running regular scans and publishing results?
It depends on the regularity. What did you have in mind in that regard? We are currently running the dnsresolution module on a weekly basis and informing relay operators in case of trouble and that has been sufficient frequency-wise I think.
I don't think there are any publishing concerns, no. Where are the results supposed to show up?
2. Recommendations on scan frequency or methodology?
Yes. I think weekly should be fine at least for a start. As for the methodology it would be very much appreciated if you could upstream you changes to exitmap itself where it makes sense so we don't start creating duplicated infrastructure. Ideally, there would be only one dnsresolution module and not a myriad of different ones.
Happy to adjust our approach based on your guidance.
I tried to provide some, let me know if you had something else in mind or I forgot to address anything.
Thanks, Georg
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
Tor at 1AEO:
Appreciate the support and quick response!
You are welcome!
"You mean true DNS resolution failures per scan or general ones (including DNS)?" Both. Over the last few days, per scan (~3,000 exit relays in the consensus), we’ve observed:
~55 exit relays with true DNS resolution failures, where circuits established successfully but DNS resolution failed with the exit.
~415 exit relays with circuit or infrastructure failures, where we were unable to reach the relay.
Breaking that down further:
DNS failures (~55 total): ~33 NXDOMAIN responses (SOCKS4: domain not found) ~21 DNS query timeouts (45 seconds, 3 attempts) ~1 case where a wrong IP was returned
Circuit / infrastructure failures (~415 total): ~328 cases where the relay channel closed unexpectedly ~59 circuit construction timeouts ~29 circuits closed or destroyed
The remaining ~2,570 exit relays (~98%) successfully resolved a unique wildcard DNS query per timestamp and relay fingerprint.
Those results are interesting, thanks. The true DNS failures are pretty high compared to what we have been getting over the years when testing whether example.com and torproject.org are resolvable. Anything between 5 and 10 issues per week seems not unreasonable according to the data we have, but 55 is clearly an outlier worthy some explanation, in particular as we got just 5 relays with issues last week during our weekly scan. Can you share the fingerprints of the relays you found so I can have a closer look and check whether you might have a bunch of false positives in your results? Are the results for those relays stable if you scan over the course of a couple of days or are they fluctuating?
"It depends on the regularity. What did you have in mind in that regard?"
Our current plan is to run scans every few hours to provide faster feedback when an operator is actively troubleshooting an issue, given that this is a single DNS query per relay. We’ll refine cadence as we get feedback and learn more.
Hrm. exitmap can run modules for particular relays (provided on the command line either per fingerprint or file). So, when an operator is troubleshooting an issue it makes more sense to me to run something just with their fingerprints than to scan the network over and over again just so the operator can see whether something is working again for them. Or maybe I missed the point as to why you need to scan the whole network within a couple of hours' frequency for that use case? I think running the scan for the whole network at most once a day and then zooming closer into relay groups with issues is a strictly better deployment plan.
"I don't think there are any publishing concerns, no. Where are the results supposed to show up?"
Results will be published on a small public site with a simple JSON API for programmatic access (tentatively https://exitdnshealth.1aeo.com/), and integrated into operator-facing views (per family / AROI and per relay) on https://metrics.1aeo.com/. Happy to adjust presentation if there are preferences.
Sounds good to me at least.
"As for the methodology it would be very much appreciated if you could upstream your changes to exitmap itself…"
Absolutely. We strongly prefer not to maintain a long-term fork and are very open to submitting pull requests.
What’s the preferred workflow from your perspective — start with a PR directly, or discuss scope and structure first? For reference, current code lives at: https://github.com/1aeo/exitmap (working on a branch to make the commits easier to follow for sending upstream) https://github.com/1aeo/exitmap-dns-health-deploy
I think filing a ticket at https://gitlab.torproject.org/tpo/network-health/exitmap is a good start. I am happy to file child tickets for different parts of the work if needed (e.g. the structured output of the results), so no worries about that. If it's easiest for you to have one big MR referencing the parent ticket then just raise that one and we can have all the technical discussion there. If you want to scope the potential MR in a ticket discussion first then I am happy to do that as well. Or if there is yet another plan even more appealing to you, go for it. Up to you. :)
New question, from seeing DNSSEC support in exitmap, is DNSSEC expected to be enabled for exit relays? Hearing mixed viewpoints from operators. Debating whether it should be added to this exit relay DNS health effort and leaning towards yes.
No, it's not expected at this point. Sometimes we have modules to only gather information about what the network looks like at a particular point in time, without getting to a unhealthy/healthy discrimination. Thus, I'd say at least at this point, don't worry about the DNSSEC status either in exitmap scanning efforts or actual exit relay support. Thanks, Georg
On Monday, January 19th, 2026 at 3:14 AM, Georg Koppen via network-health <network-health@lists.torproject.org> wrote:
Hi!
Tor at 1AEO via network-health:
Hi Network Health Team,
A few large-scale exit relay operators have asked for better visibility into DNS health across their relays. We've built an exitmap module, dnshealth, to address this and want your input before we start running scans and publishing results.
Very nice work, thanks!
What It Does
- Generates unique DNS queries per relay (wildcard subdomain → expected IP) to avoid caches - Classifies failures: timeout, NXDOMAIN, wrong IP, SOCKS errors - Outputs structured JSON with latency and error details
All code is open source: https://github.com/1aeo/exitmap
Initial testing: ~98% success rate across ~3k exits, 50-90 true failures per scan, 4-8 min runtime.
You mean true DNS resolution failures per scan or general ones (including DNS)?
Before We Proceed
1. Any concerns with us running regular scans and publishing results?
It depends on the regularity. What did you have in mind in that regard? We are currently running the dnsresolution module on a weekly basis and informing relay operators in case of trouble and that has been sufficient frequency-wise I think.
I don't think there are any publishing concerns, no. Where are the results supposed to show up?
2. Recommendations on scan frequency or methodology?
Yes. I think weekly should be fine at least for a start. As for the methodology it would be very much appreciated if you could upstream you changes to exitmap itself where it makes sense so we don't start creating duplicated infrastructure. Ideally, there would be only one dnsresolution module and not a myriad of different ones.
Happy to adjust our approach based on your guidance.
I tried to provide some, let me know if you had something else in mind or I forgot to address anything.
Thanks, Georg
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
Apologies for the delay. Gathered a month of data to better answer your questions about stability of results. Responses inline and some charts attached.
Can you share the fingerprints of the relays you found so I can have a closer look and check whether you might have a bunch of false positives in your results?
Yes, the exit relays with DNS query errors are easily available with nickname, fingerprint, status, and detailed error: https://exitdnshealth.1aeo.com/ Example: ~47 relays were tested 4 times each and failed to return the correct IP address on all 4 wildcard DNS queries. Here are some examples: Nickname Fingerprint Status Error obzgs5tbmn4q 04F909DEA6F029CDCC717B726A12E03AE02C37F1 dns_fail DNS Error: SOCKS 4 - Domain not found (NXDOMAIN) PRQseTORexit3 24A04061669B89317B2CD235B0C556BAA96FD28E dns_fail DNS Error: SOCKS 4 - Domain not found (NXDOMAIN) obzgs5tbmn4q 1F9A218BF276554927EBEFFE0B86826697CBDDD2 dns_fail DNS Error: SOCKS 4 - Domain not found (NXDOMAIN) obzgs5tbmn4q 1C148C180F239F8D6BB44967E526EB4FD04D53AF dns_fail DNS Error: SOCKS 4 - Domain not found (NXDOMAIN) PRQseTORexit 439600DC8013EAF8C8ED9CDB35B9D0C65C3805E4 dns_fail DNS Error: SOCKS 4 - Domain not found (NXDOMAIN) One hypothesis for the discrepancy is that the new DNS health module resolves unique wildcard domains per relay and per run, bypassing caches, whereas the existing dnsresolution module tests fixed, commonly cached domains. That difference may surface resolver behavior or filtering that does not appear when resolving example.com or torproject.org.
Are the results for those relays stable if you scan over the course of a couple of days or are they fluctuating?
Yes, roughly ~22 relays repeatedly failing. Roughly half, 11, change over time. We reached out to some of the large scale exit relay operators who we were already in contact with and they resolved some of the issues on their relays. Attached images at the end: Persistent and Failure Churn. Some relays always failing the last month: obzgs5tbmn4q, PRQseTORexit3, PRQseTORexit, PRQseTORexit2, PRQseTORexit6, PRQseTORexit9 (Attached chart: Relay Failure Frequency Distribution) These specific relays are clustered among a small number of operators, which may indicate shared resolver or forwarding configuration rather than widespread network-wide issues.
~415 exit relays with circuit or infrastructure failures, where we were unable to reach the relay.
We observed higher variability in circuit reachability than expected. We confirmed the behavior across multiple scanning servers in different geographic regions using high-quality guard relays (fast, stable, uptime, bandwidth minimum, etc.). Full root-cause triage is still pending. To reduce false positives, we now perform 4 parallel runs every 12 hours and consider a relay healthy if any run has a successful DNS resolution. The 4 parallel runs have higher success rate and lower failures. Ex: DNS success rate: 97.5% to 98.9% Ex: Relay validation errors: ~415 to ~15 relays per run (Unreachable / Timeout chart attached).
Hrm. exitmap can run modules for particular relays (provided on the command line either per fingerprint or file). So, when an operator is troubleshooting an issue it makes more sense to me to run something just with their fingerprints than to scan the network over and over again just so the operator can see whether something is working again for them. Or maybe I missed the point as to why you need to scan the whole network within a couple of hours' frequency for that use case? I think running the scan for the whole network at most once a day and then zooming closer into relay groups with issues is a strictly better deployment plan.
Several operators indicated they prefer centralized visibility rather than running exitmap locally, citing time constraints and operational overhead. For timing, agreed, as we're not seeing significant changes every few hours. We've switched to running every 12 hours. Still have more variability than preferred and will keep working on reducing that. More details below. Operators we spoke with did not express concern about the additional DNS volume (~8 queries per day per relay), noting that this is negligible compared to normal exit traffic levels and these queries provide insights they aren't seeing otherwise. Appreciate the guidance on filing a ticket — we’ll open one shortly to begin the upstream discussion. On Monday, January 26th, 2026 at 2:36 AM, Georg Koppen via network-health <network-health@lists.torproject.org> wrote:
Tor at 1AEO:
Appreciate the support and quick response!
You are welcome!
"You mean true DNS resolution failures per scan or general ones (including DNS)?" Both. Over the last few days, per scan (~3,000 exit relays in the consensus), we’ve observed:
~55 exit relays with true DNS resolution failures, where circuits established successfully but DNS resolution failed with the exit.
~415 exit relays with circuit or infrastructure failures, where we were unable to reach the relay.
Breaking that down further:
DNS failures (~55 total): ~33 NXDOMAIN responses (SOCKS4: domain not found) ~21 DNS query timeouts (45 seconds, 3 attempts) ~1 case where a wrong IP was returned
Circuit / infrastructure failures (~415 total): ~328 cases where the relay channel closed unexpectedly ~59 circuit construction timeouts ~29 circuits closed or destroyed
The remaining ~2,570 exit relays (~98%) successfully resolved a unique wildcard DNS query per timestamp and relay fingerprint.
Those results are interesting, thanks. The true DNS failures are pretty high compared to what we have been getting over the years when testing whether example.com and torproject.org are resolvable. Anything between 5 and 10 issues per week seems not unreasonable according to the data we have, but 55 is clearly an outlier worthy some explanation, in particular as we got just 5 relays with issues last week during our weekly scan.
Can you share the fingerprints of the relays you found so I can have a closer look and check whether you might have a bunch of false positives in your results? Are the results for those relays stable if you scan over the course of a couple of days or are they fluctuating?
"It depends on the regularity. What did you have in mind in that regard?"
Our current plan is to run scans every few hours to provide faster feedback when an operator is actively troubleshooting an issue, given that this is a single DNS query per relay. We’ll refine cadence as we get feedback and learn more.
Hrm. exitmap can run modules for particular relays (provided on the command line either per fingerprint or file). So, when an operator is troubleshooting an issue it makes more sense to me to run something just with their fingerprints than to scan the network over and over again just so the operator can see whether something is working again for them. Or maybe I missed the point as to why you need to scan the whole network within a couple of hours' frequency for that use case?
I think running the scan for the whole network at most once a day and then zooming closer into relay groups with issues is a strictly better deployment plan.
"I don't think there are any publishing concerns, no. Where are the results supposed to show up?"
Results will be published on a small public site with a simple JSON API for programmatic access (tentatively https://exitdnshealth.1aeo.com/), and integrated into operator-facing views (per family / AROI and per relay) on https://metrics.1aeo.com/. Happy to adjust presentation if there are preferences.
Sounds good to me at least.
"As for the methodology it would be very much appreciated if you could upstream your changes to exitmap itself…"
Absolutely. We strongly prefer not to maintain a long-term fork and are very open to submitting pull requests.
What’s the preferred workflow from your perspective — start with a PR directly, or discuss scope and structure first? For reference, current code lives at: https://github.com/1aeo/exitmap (working on a branch to make the commits easier to follow for sending upstream) https://github.com/1aeo/exitmap-dns-health-deploy
I think filing a ticket at https://gitlab.torproject.org/tpo/network-health/exitmap is a good start. I am happy to file child tickets for different parts of the work if needed (e.g. the structured output of the results), so no worries about that. If it's easiest for you to have one big MR referencing the parent ticket then just raise that one and we can have all the technical discussion there. If you want to scope the potential MR in a ticket discussion first then I am happy to do that as well. Or if there is yet another plan even more appealing to you, go for it. Up to you. :)
New question, from seeing DNSSEC support in exitmap, is DNSSEC expected to be enabled for exit relays? Hearing mixed viewpoints from operators. Debating whether it should be added to this exit relay DNS health effort and leaning towards yes.
No, it's not expected at this point. Sometimes we have modules to only gather information about what the network looks like at a particular point in time, without getting to a unhealthy/healthy discrimination. Thus, I'd say at least at this point, don't worry about the DNSSEC status either in exitmap scanning efforts or actual exit relay support.
Thanks, Georg
On Monday, January 19th, 2026 at 3:14 AM, Georg Koppen via network-health <network-health@lists.torproject.org> wrote:
Hi!
Tor at 1AEO via network-health:
Hi Network Health Team,
A few large-scale exit relay operators have asked for better visibility into DNS health across their relays. We've built an exitmap module, dnshealth, to address this and want your input before we start running scans and publishing results.
Very nice work, thanks!
What It Does
- Generates unique DNS queries per relay (wildcard subdomain → expected IP) to avoid caches - Classifies failures: timeout, NXDOMAIN, wrong IP, SOCKS errors - Outputs structured JSON with latency and error details
All code is open source: https://github.com/1aeo/exitmap
Initial testing: ~98% success rate across ~3k exits, 50-90 true failures per scan, 4-8 min runtime.
You mean true DNS resolution failures per scan or general ones (including DNS)?
Before We Proceed
1. Any concerns with us running regular scans and publishing results?
It depends on the regularity. What did you have in mind in that regard? We are currently running the dnsresolution module on a weekly basis and informing relay operators in case of trouble and that has been sufficient frequency-wise I think.
I don't think there are any publishing concerns, no. Where are the results supposed to show up?
2. Recommendations on scan frequency or methodology?
Yes. I think weekly should be fine at least for a start. As for the methodology it would be very much appreciated if you could upstream you changes to exitmap itself where it makes sense so we don't start creating duplicated infrastructure. Ideally, there would be only one dnsresolution module and not a myriad of different ones.
Happy to adjust our approach based on your guidance.
I tried to provide some, let me know if you had something else in mind or I forgot to address anything.
Thanks, Georg
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
_______________________________________________ network-health mailing list -- network-health@lists.torproject.org To unsubscribe send an email to network-health-leave@lists.torproject.org
participants (3)
-
forest-relay-contact@cryptolab.net -
Georg Koppen -
Tor at 1AEO