Hello.
Recently I have decided to create a new relay. After several days of waiting, I have realized that decision of Bandwidth Authorities, that my bandwidth is 1000 times lower than it should be, is pretty stable.
That is bad on its own, but I was wandering - how many other relays suffers from the same problem? Since all network data is open to analysis, I have decided to calculate some statistics. As "Consensus Weight", theoretically, should correspond to relay's bandwidth, first thought was to compare it with "Advertised Bandwidth" value (assuming there not too many liars on the network). The result has revealed some anomalies: https://s8.hostingkartinok.com/uploads/images/2017/06/fed1cf8b57fc027223c8ea... First, and most important, - a lot of relays have bandwidth estimate in range 0-50: 1082 of them. Second - there are incorrect estimates for popular bandwidths of 5, 10 and 20 MBits. Next question was: what estimates was actually assigned to that bandwidth spikes? Maybe all zeroes? This led me to another charts: https://s8.hostingkartinok.com/uploads/images/2017/06/8cefb70fce667a1b89c783... https://s8.hostingkartinok.com/uploads/images/2017/06/2e42634ea3f9b71df8a7fd... x here is "Advertised Bandwidth", y is "Consensus Weight". I was expected to see something close to x = y line. But result was much worse. First problem (not too important) is a lot of randomness. 5 MiB relay can be easily detected as 1 MiB or 10 MiB. Second one is a thing, which, probably, steals a lot of available network bandwidth: relays with low "Advertised Bandwidth" gets much less traffic than they can handle. Almost no relay with speed < 500 KiB is rated correctly. Similarly, high-speed relays have higher weight than needed. If all 0-50KiB-estimated relays are capable of serving at least 100 KiB, fixing this problem will lead to ~ (100-25)*1082 = 82 MiB/s increase of network bandwidth. But they have even more potential, I think.
Do anyone have ideas how to solve this problem?
-- Vort
Hi,
Thanks for writing to us. This is a question that gets asked a lot:
"Many people set up new fast relays and then wonder why their bandwidth is not fully loaded instantly…"
https://blog.torproject.org/blog/lifecycle-of-a-new-relay
I'll give short answers below, see the link for details.
On 11 Jun 2017, at 15:33, Vort vvort@yandex.ru wrote:
Hello.
Recently I have decided to create a new relay. After several days of waiting, I have realized that decision of Bandwidth Authorities, that my bandwidth is 1000 times lower than it should be, is pretty stable.
It can take a week or two for the bandwidth authorities to measure a relay.
That is bad on its own, but I was wandering - how many other relays suffers from the same problem? Since all network data is open to analysis, I have decided to calculate some statistics. As "Consensus Weight", theoretically, should correspond to relay's bandwidth, first thought was to compare it with "Advertised Bandwidth" value (assuming there not too many liars on the network).
The advertised bandwidth is the maximum a relay has seen itself use for 10 seconds in the past day or so.
The consensus weight is the median measured bandwidth over weeks for the relay from 3-5 different bandwidth authorities.
The result has revealed some anomalies: https://s8.hostingkartinok.com/uploads/images/2017/06/fed1cf8b57fc027223c8ea... First, and most important, - a lot of relays have bandwidth estimate in range 0-50: 1082 of them.
I don't know what each axis is on this graph.
20 is the default, 50 is the maximum for a relay's self-test.
If a relay isn't measured, or measures very low, it usually gets a figure in this range.
Second - there are incorrect estimates for popular bandwidths of 5, 10 and 20 MBits.
I don't understand what you mean here. The advertised bandwidth is in kilobytes per second, and the consensus weight is dimensionless (but scaled from kilobytes per second).
Can you point out the lines you mean?
Next question was: what estimates was actually assigned to that bandwidth spikes? Maybe all zeroes? This led me to another charts: https://s8.hostingkartinok.com/uploads/images/2017/06/8cefb70fce667a1b89c783... https://s8.hostingkartinok.com/uploads/images/2017/06/2e42634ea3f9b71df8a7fd... x here is "Advertised Bandwidth", y is "Consensus Weight". I was expected to see something close to x = y line.
Thanks for doing these graphs! They look as close to the x = y line as I would expect. It doesn't surprise me that there is bias at the lower end. This is less important than bias or variance at the high end.
But result was much worse. First problem (not too important) is a lot of randomness. 5 MiB relay can be easily detected as 1 MiB or 10 MiB.
This is normal: what a relay sees as its own maximum bandwidth is often different from the sustained bandwidth the tor network can get out of it.
Second one is a thing, which, probably, steals a lot of available network bandwidth: relays with low "Advertised Bandwidth" gets much less traffic than they can handle. Almost no relay with speed < 500 KiB is rated correctly. Similarly, high-speed relays have higher weight than needed.
This is good for clients: high speed relays give low latencies.
If all 0-50KiB-estimated relays are capable of serving at least 100 KiB, fixing this problem will lead to ~ (100-25)*1082 = 82 MiB/s increase of network bandwidth. But they have even more potential, I think.
Bandwidth does not add in a simple way: we are trying to minimise the bandwidth-delay product for clients, not maximise the bandwidth used.
Overloaded relays are slow, and under-used fast, nearby relays are a waste. But this is hard to detect.
Do anyone have ideas how to solve this problem?
I'm not sure if this is a problem. And I'm not sure how many relays it impacts.
But we know there is a bias in Tor's measurements towards North America and Europe, because that's where most of the measurements are made from:
https://trac.torproject.org/projects/tor/wiki/doc/BandwidthAuthorityMeasurem...
We are working on fixing this by measuring from different places. It will also help if we get more bandwidth authorities.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
Thanks for writing to us.
Thanks for the answers.
This is a question that gets asked a lot: "Many people set up new fast relays and then wonder why their bandwidth is not fully loaded instantly…" https://blog.torproject.org/blog/lifecycle-of-a-new-relay
Maybe this is correct in many cases. But definitely not in all of them.
For example, this line: "once the bwauths have measured you and the directory authorities lift the 20KB cap, you'll attract more and more traffic" Events can go other way: bwauths will assign lower weight, and relay will be getting less and less traffic.
It can take a week or two for the bandwidth authorities to measure a relay.
Relay, which hit the problem, can be in underpowered state for months.
I'm not sure if this is a problem. And I'm not sure how many relays it impacts.
Hundreds, I guess.
Here is some examples:
https://atlas.torproject.org/#details/9FC2673BB2704C2AAB851F8334938565DF1D08... Now used bandwidth: 1 KiB/s Advertised Bandwidth: 131.38 KiB/s Top used bandwidth: 250 KiB/s Bandwidth rate: 4000 KiB/s
https://atlas.torproject.org/#details/B918EB3FA4D03A4F9F632AA17F217A6C04044E... Now used bandwidth: 1 KiB/s Advertised Bandwidth: 82.65 KiB/s Top used bandwidth: 245 KiB/s Bandwidth rate: 800 KiB/s
https://atlas.torproject.org/#details/DF1C6C645C5854780778A3E81D12F2A8FF6574... Now used bandwidth: 1 KiB/s Advertised Bandwidth: 62.29 KiB/s Top used bandwidth: 7 KiB/s Bandwidth rate: 3000 KiB/s
https://atlas.torproject.org/#details/E2AF5879F39FF40DF8994E9B8FAEAB2518AEEB... Now used bandwidth: 1 KiB/s Advertised Bandwidth: 70.94 KiB/s Top used bandwidth: 916 KiB/s Bandwidth rate: 1000 KiB/s
As you can see, most of them can handle a lot more traffic: 50x-4000x. Also don't see why they can have high latency. Good relays, on my opinion.
But we know there is a bias in Tor's measurements towards North America and Europe, because that's where most of the measurements are made from:
No, this have no impact in this case.
I have launched my own instance of BwAuthority and I see, that measured "filt_bw" values are pretty close to "Advertised Bandwidth":
node_id=$9FC2673BB2704C2AAB851F8334938565DF1D0819 nick=qq strm_bw=52732 filt_bw=77967 circ_fail_rate=0.0 desc_bw=134537 ns_bw=13000 node_id=$9FC2673BB2704C2AAB851F8334938565DF1D0819 nick=qq strm_bw=61278 filt_bw=70430 circ_fail_rate=0.0 desc_bw=85495 ns_bw=13000 node_id=$B918EB3FA4D03A4F9F632AA17F217A6C04044EF7 nick=TranTor strm_bw=40485 filt_bw=47052 circ_fail_rate=0.0 desc_bw=84635 ns_bw=12000
The problem is on the next step, I think.
The result has revealed some anomalies: https://s8.hostingkartinok.com/uploads/images/2017/06/fed1cf8b57fc027223c8ea... First, and most important, - a lot of relays have bandwidth estimate in range 0-50: 1082 of them.
I don't know what each axis is on this graph.
x is KiB/s, y is count (yellow bars are for "Advertised Bandwidth", blue - for "Consensus Weight", grey mean both values)
20 is the default, 50 is the maximum for a relay's self-test. If a relay isn't measured, or measures very low, it usually gets a figure in this range.
I have excluded non-measured relays from this histogram.
Second - there are incorrect estimates for popular bandwidths of 5, 10 and 20 MBits.
I don't understand what you mean here. The advertised bandwidth is in kilobytes per second, and the consensus weight is dimensionless (but scaled from kilobytes per second).
Can you point out the lines you mean?
Look at the yellow spike at x = ~1200. Low blue bars at the same point means that "Consensus Weight" model did not take into account that there are many 1200 KiB/s nodes on the network, which will result in theirs underload.
-- Vort
On 13 Jun 2017, at 00:58, Vort vvort@yandex.ru wrote: ...
This is a question that gets asked a lot: "Many people set up new fast relays and then wonder why their bandwidth is not fully loaded instantly…" https://blog.torproject.org/blog/lifecycle-of-a-new-relay
Maybe this is correct in many cases. But definitely not in all of them.
For example, this line: "once the bwauths have measured you and the directory authorities lift the 20KB cap, you'll attract more and more traffic" Events can go other way: bwauths will assign lower weight, and relay will be getting less and less traffic.
We know of relays that have improved their bandwidth measurements by changing their keys (this resets the measurements).
But most relays get low weights because: * they do not get good bandwidth over time, * they do not get good bandwidth to the rest of the tor network, * they have high latency to the rest of the tor network, * they can not get enough CPU or RAM, * they can not keep enough connections open, * they go up and down a lot, * they change IP address a lot, or * some other reasons that make them less useful to clients.
Here is more information about this: https://lists.torproject.org/pipermail/tor-relays/2016-November/010928.html
...
I'm not sure if this is a problem. And I'm not sure how many relays it impacts.
Hundreds, I guess.
Here is some examples:
https://atlas.torproject.org/#details/9FC2673BB2704C2AAB851F8334938565DF1D08... Now used bandwidth: 1 KiB/s Advertised Bandwidth: 131.38 KiB/s Top used bandwidth: 250 KiB/s Bandwidth rate: 4000 KiB/s
Here are the measurements from each bandwidth authority (large page): https://consensus-health.torproject.org/consensus-health-2017-06-12-22-00.ht...
And look at other relays in the same AS: https://atlas.torproject.org/#search/as:AS8100
But this isn't enough information to work out what the problem is. Maybe there is a problem with the relay, not the measurements. We just can't tell.
... As you can see, most of them can handle a lot more traffic: 50x-4000x. Also don't see why they can have high latency. Good relays, on my opinion.
Maybe the relay has low CPU, international bandwidth, or connection limits. We just don't know.
We would need to talk to the operator to find out.
... I have launched my own instance of BwAuthority and I see, that measured "filt_bw" values are pretty close to "Advertised Bandwidth": ...
These measurements are updated over time. Please check again after a few weeks.
The problem is on the next step, I think.
The result has revealed some anomalies: https://s8.hostingkartinok.com/uploads/images/2017/06/fed1cf8b57fc027223c8ea... First, and most important, - a lot of relays have bandwidth estimate in range 0-50: 1082 of them.
I don't know what each axis is on this graph.
x is KiB/s, y is count (yellow bars are for "Advertised Bandwidth", blue - for "Consensus Weight", grey mean both values) ...
Second - there are incorrect estimates for popular bandwidths of 5, 10 and 20 MBits.
I don't understand what you mean here. The advertised bandwidth is in kilobytes per second, and the consensus weight is dimensionless (but scaled from kilobytes per second).
Can you point out the lines you mean?
Look at the yellow spike at x = ~1200. Low blue bars at the same point means that "Consensus Weight" model did not take into account that there are many 1200 KiB/s nodes on the network, which will result in theirs underload.
The consensus weight model does not fit relays to a curve. It can have lots of relays at the same bandwidth.
I think this spike means:
"You think your provider is giving you 100 Mbps, but they are actually giving you much less. Talk to them about it."
Usually this is because the provider only tries to give everyone 100Mbps, or they limit everyone and don't tell them, or they don't pay enough to get good international bandwidth.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
We know of relays that have improved their bandwidth measurements by changing their keys (this resets the measurements).
1. It is not possible to change keys for relays which you don't control. 2. It is better to have such algorithms, which can't stuck.
But most relays get low weights because:
- they can not get enough CPU or RAM,
- they can not keep enough connections open,
This is not a case for relays with 1 KiB/s load.
- they go up and down a lot,
My examples was for stable relays.
- they change IP address a lot, or
ExoneraTor is lagging, but 3 of 4 example relays was using the same addresses month ago.
- they do not get good bandwidth over time,
- they do not get good bandwidth to the rest of the tor network,
- they have high latency to the rest of the tor network,
This can be measured. For example, BD4354E76929C90B7004FF149A3C52189A3B4634 is capable of serving 1 MiB/s (was made a circuit through it this morning):
r Hedgehog vUNU52kpyQtwBP8UmjxSGJo7RjQ BG894JEWmT0pcLmWTabGYlWT5Iw 2017-06-13 06:08:30 212.26.140.81 443 0 ... w Bandwidth=1024 Measured=5
- some other reasons that make them less useful to clients.
Looks like clients have no influence on BwAuth's decisions.
But this isn't enough information to work out what the problem is. Maybe there is a problem with the relay, not the measurements. We just can't tell.
What additional information can help?
Maybe the relay has low CPU, international bandwidth, or connection limits. We just don't know.
If it can retranslate a lot of traffic, then it have no such problems.
We would need to talk to the operator to find out.
I would not raised this question if I wasn't such an operator.
These measurements are updated over time. Please check again after a few weeks.
They already shows that relay is more capable than it is rated.
I think this spike means:
"You think your provider is giving you 100 Mbps, but they are actually giving you much less. Talk to them about it."
Usually this is because the provider only tries to give everyone 100Mbps, or they limit everyone and don't tell them, or they don't pay enough to get good international bandwidth.
Exact number does not matter. The problem is that weight histogram have no equivalent spike.
Here is another histogram. https://s8.hostingkartinok.com/uploads/images/2017/06/749e7e3be806c22f3dd5c0... (x, y and colors are the same) Just filtered relays so theirs Advertised Bandwidth is in range 1100000..1350000. I wouldn't say this values are "proportional" enough.
-- Vort
On 14 Jun 2017, at 02:01, Vort vvort@yandex.ru wrote: ...
But most relays get low weights because:
- they can not get enough CPU or RAM,
- they can not keep enough connections open,
This is not a case for relays with 1 KiB/s load.
- they go up and down a lot,
My examples was for stable relays.
- they change IP address a lot, or
ExoneraTor is lagging, but 3 of 4 example relays was using the same addresses month ago.
Please help us find out which of these things impact one of your relays.
- they do not get good bandwidth over time,
- they do not get good bandwidth to the rest of the tor network,
- they have high latency to the rest of the tor network,
This can be measured.
Yes, the Tor network measures it from 4 different locations every few days. What makes your measurement is more accurate?
Where are you measuring from? Is it close to the relay? How long did it take to do the download? Did you measure from different parts of the world?
...
But this isn't enough information to work out what the problem is. Maybe there is a problem with the relay, not the measurements. We just can't tell.
What additional information can help?
1. Choose a relay you control to focus on. 2. Send information about the relay's CPU and RAM and configured connection limit. 3. Measure the actual connection limit, bandwidth and latency from the rest of the Tor network. (Or from at least 2 locations in the US and Western Europe.)
Or:
Change they relay's keys, wait a few weeks, and let us know if the bandwidth measurement is better or worse.
If it is better, then the relay was put in a low bucket, and was stuck in that bucket. This can happen at random, or if the relay was slow in the past.
Maybe the relay has low CPU, international bandwidth, or connection limits. We just don't know.
If it can retranslate a lot of traffic, then it have no such problems.
I think we will have to agree to disagree about this.
...
These measurements are updated over time. Please check again after a few weeks.
They already shows that relay is more capable than it is rated.
I think we will have to agree to disagree about this.
I think this spike means:
"You think your provider is giving you 100 Mbps, but they are actually giving you much less. Talk to them about it." ...
Here is another histogram. https://s8.hostingkartinok.com/uploads/images/2017/06/749e7e3be806c22f3dd5c0... (x, y and colors are the same) Just filtered relays so theirs Advertised Bandwidth is in range 1100000..1350000. I wouldn't say this values are "proportional" enough.
I think we will have to agree to disagree about this.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
I think we will have to agree to disagree about this.
Ok, let's focus on the problem first. Conclusions can be made later.
Please help us find out which of these things impact one of your relays.
The only thing from your list, which can have effect - is the properties of other relays and theirs Internet connections. But weight is assigned to single relay, so this differences must be filtered out somehow.
Yes, the Tor network measures it from 4 different locations every few days.
And gives incorrect weight as a result. Stable incorrect result. It is needed to discuss what is a good weight. But that can be done later. My idea is that 1 KiB of load when 1000 KiB of bandwidth available is bad. Just imagine that this 1 MiB/s are really can be used. Of course, not to all network. But if some dial-up node can't use it, this doesn't mean that no one can.
What makes your measurement is more accurate?
1. I was checking this speed with random relays, which mean that high-speed relays was certainly included in the circuits. (this gives 1000000 instead of 58436) 2. I was used obtained value directly to decide if relay is good enough. (this gives 1000000 instead of 5000)
Where are you measuring from?
Ukraine.
Is it close to the relay?
I have made many measurements. Some of them was close to relay, some are not.
How long did it take to do the download?
Here are some results: extendcircuit 0 $BD4354E76929C90B7004FF149A3C52189A3B4634,$A53C46F5B157DD83366D45A8E99A244934A14C46 650 CIRC 10 BUILT $BD4354E76929C90B7004FF149A3C52189A3B4634~Hedgehog,$A53C46F5B157DD83366D45A8E99A244934A14C46~csailmitexit PURPOSE=GENERAL TIME_CREATED=2017-06-14T06:02:21.529447 650 STREAM 13 CLOSED 10 38.229.72.16:443 REASON=DONE $ curl --socks5-hostname localhost:9050 --insecure -O https://38.229.72.16/bwauth.torproject.org/16M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 16.0M 100 16.0M 0 0 856k 0 0:00:19 0:00:19 --:--:-- 985k
But this relay is close to my location. Let's select another: extendcircuit 0 $38BF40B902ABC23B4E1503BE9131F1A3BF8EBAC5,$A53C46F5B157DD83366D45A8E99A244934A14C46 650 CIRC 11 BUILT $38BF40B902ABC23B4E1503BE9131F1A3BF8EBAC5~bzerorelay1,$A53C46F5B157DD83366D45A8E99A244934A14C46~csailmitexit PURPOSE=GENERAL TIME_CREATED=2017-06-14T06:06:19.551317 650 STREAM 17 CLOSED 11 38.229.72.16:443 REASON=DONE $ curl --socks5-hostname localhost:9050 --insecure -O https://38.229.72.16/bwauth.torproject.org/16M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 16.0M 100 16.0M 0 0 268k 0 0:01:01 0:01:01 --:--:-- 271k
Of course, it is slower. But still far from utilized 2 KiB/s.
Did you measure from different parts of the world?
I don't have control over multiple locations.
- Choose a relay you control to focus on.
- Send information about the relay's CPU and RAM and configured connection limit.
* CPU: Intel Core i5-4690 * RAM: Team Group Dark-1600, 8 GiB
* RelayBandwidthRate 1 MBytes RelayBandwidthBurst 3 MBytes
- Measure the actual connection limit, bandwidth and latency from the rest of the Tor network. (Or from at least 2 locations in the US and Western Europe.)
I doubt if it is good to test with a loop connections, so I will post speedtest.net results. They give good estimate of my connection properties.
http://www.speedtest.net/my-result/6375679554 DOWNLOAD 94.86Mb/s UPLOAD 95.21Mb/s PING 24 ms SERVER SAINT PETERSBURG
http://www.speedtest.net/my-result/6375676443 DOWNLOAD 89.53Mb/s UPLOAD 94.51Mb/s PING 48 ms SERVER MILAN
http://www.speedtest.net/my-result/6375670157 DOWNLOAD 89.53Mb/s UPLOAD 94.54Mb/s PING 51 ms SERVER DRESDEN
http://www.speedtest.net/my-result/6375655269 DOWNLOAD 85.01Mb/s UPLOAD 24.01Mb/s PING 131 ms SERVER NEW YORK CITY, NY
http://www.speedtest.net/my-result/6375666714 DOWNLOAD 53.70Mb/s UPLOAD 16.19Mb/s PING 205 ms SERVER SAN FRANCISCO, CA
http://www.speedtest.net/my-result/6375685977 DOWNLOAD 24.69Mb/s UPLOAD 13.46Mb/s PING 291 ms SERVER TOKYO
Maybe Atlas graphs also can help: https://s8.hostingkartinok.com/uploads/images/2017/06/caaccae1a967a838871cde... https://s8.hostingkartinok.com/uploads/images/2017/06/60ff981f51002bd177e8c9...
Or:
Change they relay's keys, wait a few weeks, and let us know if the bandwidth measurement is better or worse.
I don't want to lose the state, which reproduces the bug.
If it is better, then the relay was put in a low bucket, and was stuck in that bucket. This can happen at random, or if the relay was slow in the past.
My relay was never slow. Possibility of such random stuck is a thing, which is needs to be eliminated.
-- Vort
Hello, teor.
Is it worth to wait till you have time to investigate stuck relays problem?
-- Vort
On 27 Jun 2017, at 14:57, Vort vvort@yandex.ru wrote:
Hello, teor.
Is it worth to wait till you have time to investigate stuck relays problem?
You can help yourself!
Before we investigate the measurements: * We need to know if anything on your relay or at your provider is making your relay slow, * We need to be know which measurement of your relay is slow.
I made a wiki page to tell people how to do that: https://trac.torproject.org/projects/tor/wiki/doc/MyRelayIsSlow
Please go through the steps, and let us know what results you get. Then someone can help you more.
It might not be me that helps you. So please talk to the list when you write back.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
Before we investigate the measurements:
- We need to know if anything on your relay or at your provider is making your relay slow,
- We need to be know which measurement of your relay is slow.
I made a wiki page to tell people how to do that: https://trac.torproject.org/projects/tor/wiki/doc/MyRelayIsSlow
Please go through the steps, and let us know what results you get. Then someone can help you more.
- Check RAM, CPU, and socket/file descriptor usage on your relay
Private bytes amount for tor.exe process is 116 MiB, 3.4 GiB of system memory is available (out of 8 GiB total). Log shows that: "Based on detected system memory, MaxMemInQueues is set to 2048 MB. You can override this by setting MaxMemInQueues by hand."
Tor is using 0-1% of CPU resource on the average. Sometimes it consumes 25% of CPU (100% of 1 core) for 10 seconds. And then returns back to normal 0-1% usage.
Tor process have 573 handles open and about 380 established TCP connections. But this is unusual activity, related to Faravahar downtime and, respectively, obtaining of Fast and HSDir flags. Usually, it have only 20-30 connections established.
- Check the internet peering (bandwidth, latency) from your relay's
provider to other relays.
Latency (ping): hviv104 / 192.42.116.16 : 40 ms PrivacyRepublic0001 / 178.32.181.96 : 39 ms Unnamed / 185.170.41.8 : 36 ms McCormickRecipes / 18.85.22.204 : 135 ms PhantomTrain4 / 65.19.167.131 : 184 ms
Bandwidth (via dopper / 192.42.113.102 and bwauth's 16M file): hviv104 : 50 KiB/s PrivacyRepublic0001 : 1.3 MiB/s Unnamed : 155 KiB/s McCormickRecipes : 947 KiB/s PhantomTrain4 : 899 KiB/s
- Check each of the votes for your relay on consensus-health
(large page), and check the median:
Consensus was published 2017-06-27 12:00:00. longclaw: bw=34 gabelmoo: bw=41 moria1 : bw=23
*median*: bw=34
- Check your relay's observed bandwidth and bandwidth rate (limit).
Bandwidth rate : 1 MiB/s Bandwidth burst : 3 MiB/s Observed bandwidth: 250.77 KiB/s
- Run a test using tor to see how fast tor can get on your network/CPU:
This will alter observed bandwidth. But okay. Depending on exit node, result varies from 117 KiB/s to 1 MiB/s. Example:
$ curl --socks5-hostname localhost:9050 --insecure -O https://38.229.72.16/bwauth.torproject.org/16M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 16.0M 100 16.0M 0 0 857k 0 0:00:19 0:00:19 --:--:-- 967k
- Run a test using tor and chutney to find out how fast tor can get on
your CPU. Keep increasing the data volume until the bandwidth stops increasing:
As this tool is designed for Linux, I can run it only within virtual machine. But results are still good:
$ CHUTNEY_DATA_BYTES=104857600 ./chutney verify networks/basic-min ... Single Stream Bandwidth: 99.46 MBytes/s Overall tor Bandwidth: 397.85 MBytes/s
$ CHUTNEY_DATA_BYTES=1048576000 ./chutney verify networks/basic-min ... Single Stream Bandwidth: 43.52 MBytes/s Overall tor Bandwidth: 174.07 MBytes/s
It might not be me that helps you. So please talk to the list when you write back.
But no one else shown the interest on answering to this topic.
-- Vort
On 27 Jun 2017, at 23:52, Vort vvort@yandex.ru wrote: ...
I made a wiki page to tell people how to do that: https://trac.torproject.org/projects/tor/wiki/doc/MyRelayIsSlow ...
- Check RAM, CPU, and socket/file descriptor usage on your relay
Private bytes amount for tor.exe process is 116 MiB,
This could be part of your issue. The code for tor relays on Windows is not maintained very well.
...
Tor process have 573 handles open and about 380 established TCP connections. But this is unusual activity, related to Faravahar downtime and, respectively, obtaining of Fast and HSDir flags. Usually, it have only 20-30 connections established.
What is the connection / handle limit on the tor process and the user you are using for the tor process?
For a non-exit relay, it needs to be around 10,000. For an large exit relay, it needs to be 50,000 or so.
...
- Check each of the votes for your relay on consensus-health
(large page), and check the median:
Consensus was published 2017-06-27 12:00:00. longclaw: bw=34 gabelmoo: bw=41 moria1 : bw=23
*median*: bw=34
This is the limit on your relay.
Now check the latency and bandwidth to these directory authorities. But only do to once, they have a lot of load already.
Also, use gabelmoobwauth, rather than gabelmoo. And check Faravahar.
- Check your relay's observed bandwidth and bandwidth rate (limit).
Bandwidth rate : 1 MiB/s Bandwidth burst : 3 MiB/s Observed bandwidth: 250.77 KiB/s
Ok, the next limit will be the observed bandwidth.
...
It might not be me that helps you. So please talk to the list when you write back.
But no one else shown the interest on answering to this topic.
If you just write to me, that won't change. You need to be patient.
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
This could be part of your issue. The code for tor relays on Windows is not maintained very well.
There are many relays on Windows, which are not stuck. And many relays on Linux, which are stuck.
What is the connection / handle limit on the tor process and the user you are using for the tor process?
For a non-exit relay, it needs to be around 10,000. For an large exit relay, it needs to be 50,000 or so.
Windows does not limit connection count for processes and users. There are also no system-wide limit for sockets. Except for available dynamic port range (1025-64510 on my computer).
Now check the latency and bandwidth to these directory authorities. But only do to once, they have a lot of load already.
Also, use gabelmoobwauth, rather than gabelmoo. And check Faravahar.
Latency (ping): longclaw / 199.254.238.53 : 187 ms gabelmoobwscan / 131.188.40.189 : 44 ms moria1 / 128.31.0.34 : 128 ms faravahar / 154.35.175.225 : 147 ms
Bandwidth (via PrivacyRepublic0001 and 16M file from 38.229.72.16): longclaw : 285 KiB/s gabelmoobwscan : 1195 KiB/s moria1 : 404 KiB/s faravahar : 141 KiB/s
Ok, the next limit will be the observed bandwidth.
After the yesterday test #5, observed bandwidth changed to 1.12 MiB/s.
You need to be patient.
That's not a problem if I know that something will definitely change in the future.
-- Vort
On 28 Jun 2017, at 15:39, Vort vvort@yandex.ru wrote: ...
What is the connection / handle limit on the tor process and the user you are using for the tor process?
For a non-exit relay, it needs to be around 10,000. For an large exit relay, it needs to be 50,000 or so.
Windows does not limit connection count for processes and users. There are also no system-wide limit for sockets. Except for available dynamic port range (1025-64510 on my computer).
Depending on your Windows version, the limit may be around 2000-4000, check this article: http://smallvoid.com/article/winnt-tcpip-max-limit.html
You should also check how many connections your relay is actually making.
Now check the latency and bandwidth to these directory authorities. But only do to once, they have a lot of load already.
Also, use gabelmoobwauth, rather than gabelmoo. And check Faravahar.
Latency (ping): longclaw / 199.254.238.53 : 187 ms gabelmoobwscan / 131.188.40.189 : 44 ms moria1 / 128.31.0.34 : 128 ms faravahar / 154.35.175.225 : 147 ms
Bandwidth (via PrivacyRepublic0001 and 16M file from 38.229.72.16): longclaw : 285 KiB/s gabelmoobwscan : 1195 KiB/s moria1 : 404 KiB/s faravahar : 141 KiB/s
Ok, so if your relay is in the 16MB bucket, it should be measured at at least 200 after a few weeks. But it's hard to tell which bucket each relay is in, that depends on the bandwidth authority.
Ok, the next limit will be the observed bandwidth.
After the yesterday test #5, observed bandwidth changed to 1.12 MiB/s.
That might unstick your relay. We need to know if this happens, because it helps us to know what to do to fix stuck relays.
You need to be patient.
That's not a problem if I know that something will definitely change in the future.
We are working on it a few different ways: * increasing the minimum bandwidth authority file size * making an automatic process to un-stick stuck relays * getting more bandwidth authorities in more places * re-writing the bandwidth authority code
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
Windows does not limit connection count for processes and users. There are also no system-wide limit for sockets. Except for available dynamic port range (1025-64510 on my computer).
Depending on your Windows version, the limit may be around 2000-4000, check this article: http://smallvoid.com/article/winnt-tcpip-max-limit.html
1. This article is from year 2004. Most described parameters are removed on modern systems.
2. Available port range on my computer is 64510 - 1025 = 63485 ports. But this limits only outbound connection count. Inbound connection count is unlimited. https://msdn.microsoft.com/en-us/library/windows/desktop/ms739169(v=vs.85).a... : "The Microsoft Winsock provider limits the maximum number of sockets supported only by available memory on the local computer."
3. Half-open connections limit, mentioned in article, was removed starting from Windows 7.
You should also check how many connections your relay is actually making.
Jun 28 01:30:51.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 384 v4 connections; and received 26 v1 connections, 0 v2 connections, 0 v3 connections, and 639 v4 connections. Jun 28 07:30:51.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 664 v4 connections; and received 46 v1 connections, 0 v2 connections, 0 v3 connections, and 1200 v4 connections. Jun 28 13:30:51.000 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 0 v3 connections, and 725 v4 connections; and received 63 v1 connections, 1 v2 connections, 0 v3 connections, and 1469 v4 connections.
Ok, so if your relay is in the 16MB bucket, it should be measured at at least 200 after a few weeks. But it's hard to tell which bucket each relay is in, that depends on the bandwidth authority.
If my relay resurrects, that would be great. But more important goal is to prevent the possibility of such stuck. Instead of + 1 MiB/s this can yield + 1 GiB/s.
That might unstick your relay. We need to know if this happens, because it helps us to know what to do to fix stuck relays.
Yes, it can. Some relays are already unstuck (9FC2673BB2704C2AAB851F8334938565DF1D0819, 143BC876D403003FBEF2AA843942DC4D248E3872 for example). But some stuck even deeper (B918EB3FA4D03A4F9F632AA17F217A6C04044EF7, BD4354E76929C90B7004FF149A3C52189A3B4634). So my fear is that routers are get unstuck at the expense of some getting stuck. If this is not the case, that is great!
We are working on it a few different ways:
- increasing the minimum bandwidth authority file size
- making an automatic process to un-stick stuck relays
- getting more bandwidth authorities in more places
- re-writing the bandwidth authority code
I saw some changes and was wondering if they are random or not. Thanks for your work.
-- Vort
In order to estimate the effect of relays unstuck measures, I have made some graphs.
The first graph shows how many relays in the network have weight < 20 (in percents, relative to total measured, valid and running count):
https://s8.hostingkartinok.com/uploads/images/2017/06/e905280414853a800031e7...
And I see some drop at June, 15: from ~7.4% to ~6.6%. This corresponds to increase of my relay weight: from ~10 to ~20. Looks like this is the effect of increasing the minimum test file size.
Second graph is the same, but for weight < 100:
https://s8.hostingkartinok.com/uploads/images/2017/06/7768344880b81c80442aaa...
And there no effect can be seen at this scale.
(Don't know if this analysis can help, but, anyway, here it is)
-- Vort
On 29 Jun 2017, at 16:03, Vort vvort@yandex.ru wrote:
In order to estimate the effect of relays unstuck measures, I have made some graphs.
The first graph shows how many relays in the network have weight < 20 (in percents, relative to total measured, valid and running count):
https://s8.hostingkartinok.com/uploads/images/2017/06/e905280414853a800031e7...
And I see some drop at June, 15: from ~7.4% to ~6.6%. This corresponds to increase of my relay weight: from ~10 to ~20. Looks like this is the effect of increasing the minimum test file size.
No, we have not changed the minimum test file size yet. This is probably just random measurement variation.
The only thing we have started testing over the last few days is:
- making an automatic process to un-stick stuck relays
We tried to unstick some of the lowest bandwidth relays (below 1000), and our initial results are: * most (15) of relays we tried are actually very slow, or down, * some (3) relays that we tried went down before we could see if we had changed anything, * 1 relay that we tried increased its bandwidth 10x, but we don't know if it was because of us, or something else.
We are trying on a larger set now. If we get better results, maybe we will make it automatic. But these results indicate that most relays that are measured slow are actually slow for tor clients. (Which is what matters.)
T -- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n xmpp: teor at torproject dot org ------------------------------------------------------------------------
We tried to unstick some of the lowest bandwidth relays (below 1000),
You mean 0..1 weight (1 KiB/s and lower)? They may be really bad.
I guess, that good test range is 10..20 (or 5..30).
and our initial results are:
- most (15) of relays we tried are actually very slow, or down,
- some (3) relays that we tried went down before we could see if we had changed anything,
Relays can be additionally filtered by Stable flag (or uptime). If relay haven't rebooted for weeks, there is a big chance that it will stay online during the tests too.
We are trying on a larger set now.
I hope this will give more successful attempts.
But these results indicate that most relays that are measured slow are actually slow for tor clients. (Which is what matters.)
If larger set will not give better results, I will try to make my own test program and launch it from my location. Maybe different approach will give different results. I am still sure, that low weight estimate is hiding many fast relays.
-- Vort
Finally, I have made my test program and found that I was wrong about two things:
1. Low weight relays (< 30) rarely give fast speed (> 150 KiB/s) on two-hop circuits. With three hops, fast speed even more rare thing.
2. Windows version of Tor really have some problems.
I don't quite understand the factors, which have influence on circuit speed, but, at least, I have found how it is possible to obtain the low bandwidth estimate for my relay.
I have selected two entry nodes and two exit nodes: refEntry1 D665C959571041972EA8C0DD77559EF5579BA112 refEntry2 13B2354C74CCE29815B4E1F692F2F0E86C7F13DD refExit1 5CECC5C30ACC4B3DE462792323967087CC53D947 refExit2 07C05ED4825F51D5BE4CDBBAA80BFA484132A2F5
Then launched four circuits: refEntry1, myNode, refExit1 refEntry1, myNode, refExit2 refEntry2, myNode, refExit1 refEntry2, myNode, refExit2
And measured their bandwidth: 1. 117 KiB/s 2. 122 KiB/s 3. 59 KiB/s 4. 51 KiB/s
That was pretty strange. Previous tests with speedtest.net showed that 500-1000 KiB/s speeds are not a problem for my connection.
Next idea was to measure the neighbor relay (from the same city and ISP). And this resulted in following speeds: 1. 356 KiB/s 2. 392 KiB/s 3. 375 KiB/s 4. 271 KiB/s
Not too fast, but definitely better than result for my relay.
So one of the limiting factors is located somewhere on my computer. Connection count is fine, RAM and CPU are also good enough.
The only difference left is operating system.
It is possible for me to boot from USB stick with some Linux. But first I have decided to make a test with virtual machine.
Port forwarding was set, Ubuntu and new Tor relay are launched and here is the result: 1. 452 KiB/s 2. 375 KiB/s 3. 141 KiB/s 4. 163 KiB/s
Usually adding a virtual machine leads to worse results. But not this time.
So the next question is: How Linux version inside Windows can perform three times better than Windows version alone?
And what is the real limit for my configuration? Even if 100 KiB/s speed changes to 300 KiB/s, this will be still far from 1 MiB/s, 5 MiB/s, 10 MiB/s, which are possible with my connection.
-- Vort
tor-relays@lists.torproject.org