Have a new exit running in an excellent network on a very fast server with AES-NI. Server plan is limited to100TB so have set a limit slightly above this (18000000 bytes/sec) thinking that bandwidth would run 80-90% of the maximum and average to just below the plan limit. After three days the assigned bandwidth for the relay is going up instead of moderating--looks like the measurement system has a problem in this case. Just dropped the limit to 16500000 to stay below 100TB with maximum load.
A good number appears to be around 65000 to 70000, but 98000 was just assigned.
What can be done to extract proper rating from measurement system? Tried setting TokenBucketRefillInterval to 10 milliseconds for more exact control but this has not helped. Should an IPTABLES packet-dropping limit be established? Can the rating system be fixed?
On 1 Oct 2015, at 14:33, Dhalgren Tor dhalgren.tor@gmail.com wrote:
Have a new exit running in an excellent network on a very fast server with AES-NI. Server plan is limited to100TB so have set a limit slightly above this (18000000 bytes/sec) thinking that bandwidth would run 80-90% of the maximum and average to just below the plan limit.
How did you set this limit? What did you write in your torrc file?
After three days the assigned bandwidth for the relay is going up instead of moderating--looks like the measurement system has a problem in this case. Just dropped the limit to 16500000 to stay below 100TB with maximum load.
A good number appears to be around 65000 to 70000, but 98000 was just assigned.
What can be done to extract proper rating from measurement system? Tried setting TokenBucketRefillInterval to 10 milliseconds for more exact control but this has not helped. Should an IPTABLES packet-dropping limit be established? Can the rating system be fixed? _______________________________________________ tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
On Thu, Oct 1, 2015 at 12:40 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
How did you set this limit? What did you write in your torrc file?
was
BandwidthBurst 18000000 BandwidthRate 18000000 TokenBucketRefillInterval 10
is now
BandwidthBurst 16500000 BandwidthRate 16500000 TokenBucketRefillInterval 10
On 1 Oct 2015, at 14:48, Dhalgren Tor dhalgren.tor@gmail.com wrote:
On Thu, Oct 1, 2015 at 12:40 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
How did you set this limit? What did you write in your torrc file?
...
is now
BandwidthBurst 16500000 BandwidthRate 16500000 TokenBucketRefillInterval 10
After three days the assigned bandwidth for the relay is going up instead of moderating--looks like the measurement system has a problem in this case. Just dropped the limit to 16500000 to stay below 100TB with maximum load.
A good number appears to be around 65000 to 70000, but 98000 was just assigned.
Since I don’t have your relay fingerprint, I don’t know where you got this figure from. Is it the consensus weight? The consensus weight is a unitless figure. It does not represent a number in bytes per second.
Is it the bandwidth used by the relay? If so, that is a bug.
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
On Thu, Oct 1, 2015 at 12:55 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
On 1 Oct 2015, at 14:48, Dhalgren Tor dhalgren.tor@gmail.com wrote:
A good number appears to be around 65000 to 70000, but 98000 was just assigned.
Since I don’t have your relay fingerprint, I don’t know where you got this figure from.
FP is Dhalgren A0F06C2FADF88D3A39AA3072B406F09D7095AC9E
Is it the consensus weight?
yes, this refers to the consensus weight.
The consensus weight is a unitless figure. It does not represent a number in bytes per second.
yes, know that; the 65-70k ideal number is based on what similar, properly loaded relays have. Relay was assigned 73k yesterday and was close to, but still above a proper load
If so, that is a bug.
Does seem like a bug.
On 1 Oct 2015, at 15:03, Dhalgren Tor dhalgren.tor@gmail.com wrote:
On Thu, Oct 1, 2015 at 12:55 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
On 1 Oct 2015, at 14:48, Dhalgren Tor dhalgren.tor@gmail.com wrote:
A good number appears to be around 65000 to 70000, but 98000 was just assigned.
Since I don’t have your relay fingerprint, I don’t know where you got this figure from.
FP is Dhalgren A0F06C2FADF88D3A39AA3072B406F09D7095AC9E
Is it the consensus weight?
yes, this refers to the consensus weight.
The consensus weight is a unitless figure. It does not represent a number in bytes per second.
yes, know that; the 65-70k ideal number is based on what similar, properly loaded relays have. Relay was assigned 73k yesterday and was close to, but still above a proper load
You can’t control your consensus weight, as it will change as the bandwidth authorities measure your relay, and as the rest of the network changes. Instead, configure your relay to only use the bandwidth you want it to use.
If so, that is a bug.
Does seem like a bug.
Can you help me understand what you think the bug is?
Is Tor using more bandwidth that the BandwidthRate? If so, this is a bug, and should be reported on the Tor Trac. Please check if it’s covered by https://trac.torproject.org/projects/tor/ticket/17170 https://trac.torproject.org/projects/tor/ticket/17170 “documenation for BandwidthRate etc should mention TCP/IP overhead not included"
If Tor is using less bandwidth than the BandwidthRate, but it’s still too high for you, please try setting the BandwidthRate at or slightly below what you want Tor to use.
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
On Thu, Oct 1, 2015 at 1:10 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
Can you help me understand what you think the bug is?
Relay is assigned a consensus weight that is too high w/r/t rate limit. Excess weight appears to be due to high quality of TCP/IP connectivity and low latency of relay. Result is overloaded relay and poor end-use latency.
Is Tor using more bandwidth that the BandwidthRate?
No, but relay is loaded to flat-line maximum and clearly is attracting too many circuits.
If so, this is a bug, and should be reported on the Tor Trac.
Not interested in doing that. Looking for a way to get it work.
If the relay stays overloaded I'll try a packet-dropping IPTABLES rule to "dirty-up" the connection.
On 1 Oct 2015, at 15:22, Dhalgren Tor dhalgren.tor@gmail.com wrote:
On Thu, Oct 1, 2015 at 1:10 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
Can you help me understand what you think the bug is?
Relay is assigned a consensus weight that is too high w/r/t rate limit. Excess weight appears to be due to high quality of TCP/IP connectivity and low latency of relay. Result is overloaded relay and poor end-use latency.
Is Tor using more bandwidth that the BandwidthRate?
No, but relay is loaded to flat-line maximum and clearly is attracting too many circuits.
If so, this is a bug, and should be reported on the Tor Trac.
Not interested in doing that. Looking for a way to get it work.
If the relay stays overloaded I'll try a packet-dropping IPTABLES rule to "dirty-up" the connection.
Please reduce your BandwidthRate until your relay load is what you want it to be, or wait until the bandwidth authorities notice your relay is overloaded and reduce its consensus weight.
Dropping packets using IPTABLES will actually increase your relay’s load due to retransmits, and degrade the performance of clients which use your relay. An IPTABLES rule would also degrade the performance of the overall Tor network due to these same retransmits.
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
On Thu, Oct 1, 2015 at 1:33 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
On 1 Oct 2015, at 15:22, Dhalgren Tor dhalgren.tor@gmail.com wrote:
If the relay stays overloaded I'll try a packet-dropping IPTABLES rule to "dirty-up" the connection.
Please reduce your BandwidthRate until your relay load is what you want it to be, or wait until the bandwidth authorities notice your relay is overloaded and reduce its consensus weight.
You are willfully missing the entire point here.
HAVE set the bandwidth limit and the measurement system is overrating the relay despite this.
Dropping packets using IPTABLES will actually increase your relay’s load due to retransmits, and degrade the performance of clients which use your relay. An IPTABLES rule would also degrade the performance of the overall Tor network due to these same retransmits.
I will give it a couple of days to normalize, but if the relay remains overrated I will proceed with a IPTABLES packet-dropping rate limit.
The point is to have the measurement system set a rating that gives a 80-90% load for active circuits. At that load no packets will be discarded for regular Tor users. The packet-dropping rate limit will only impact measurement connections.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
You only mentioned the 100TB plan limit, this is why I suggested AccountingMax. I couldn't have guessed you are talking about some other policy limits.
The consensus weight is your bandwidth measured by the bandwidth authorities. This is used by clients to calculate your relay's probability from being chosen in a circuit. If it's big, yes of course it will attract more clients. If you rate-limit it in torrc, the bandwidth authorities won't be able to measure more than what you are offering. Maybe use this:
MaxAdvertisedBandwidth N bytes|KBytes|MBytes|GBytes|KBits|MBits|GBits
If set, we will not advertise more than this amount of bandwidth for our BandwidthRate. Server operators who want to reduce the number of clients who ask to build circuits through them (since this is proportional to advertised bandwidth rate) can thus reduce the CPU demands on their server without impacting network performance.
On 10/1/2015 4:22 PM, Dhalgren Tor wrote:
On Thu, Oct 1, 2015 at 1:10 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
Can you help me understand what you think the bug is?
Relay is assigned a consensus weight that is too high w/r/t rate limit. Excess weight appears to be due to high quality of TCP/IP connectivity and low latency of relay. Result is overloaded relay and poor end-use latency.
Is Tor using more bandwidth that the BandwidthRate?
No, but relay is loaded to flat-line maximum and clearly is attracting too many circuits.
If so, this is a bug, and should be reported on the Tor Trac.
Not interested in doing that. Looking for a way to get it work.
If the relay stays overloaded I'll try a packet-dropping IPTABLES rule to "dirty-up" the connection.
Don't cap the speed if you have bandwidth limits. The better way to do it is using AccountingMax in torrc. Just let it run at its full speed less of the time and Tor will enter in hibernation once it has no bandwidth left.
Not possible. Will violate the FUP (fair use policy) on the account.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Ouch, that's wrong. "BandwidthBurst" and "BandwidthRate" refer to bandwidth consumed by Tor as a client, e.g your localhost SOCKS5. If you are trying to limit RELAYED traffic, as in sent and received by your relay functionality you should use:
"RelayBandwidthRate" and "RelayBandwidthBurst" but these are an inferior solution to make your Tor relay not consume more than n TB per month. See my previous email, use "AccountingMax", checkout the manual linked in my previous email it's highly customizable. Either give it 23 TByetes per week, either 96 TBytes per month... as you like.
On 10/1/2015 3:48 PM, Dhalgren Tor wrote:
On Thu, Oct 1, 2015 at 12:40 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
How did you set this limit? What did you write in your torrc file?
was
BandwidthBurst 18000000 BandwidthRate 18000000 TokenBucketRefillInterval 10
is now
BandwidthBurst 16500000 BandwidthRate 16500000 TokenBucketRefillInterval 10
On Thu, Oct 1, 2015 at 12:59 PM, s7r s7r@sky-ip.org wrote:
Ouch, that's wrong.
I have it correct. You are mistaken.
See https://www.torproject.org/docs/tor-manual.html.en
and read it closely.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Hello,
Don't cap the speed if you have bandwidth limits. The better way to do it is using AccountingMax in torrc. Just let it run at its full speed less of the time and Tor will enter in hibernation once it has no bandwidth left.
Example: remove RelayBandwidthRate, RelayBandwidthBurst and TokenBucketRefiillInterval or any other strange and unnecessary settings you might have for bandwidth limiting. Add the following:
AccountingMax 96 TBytes AccountingStart month 1 00:00
Search for 'accounting' here: https://www.torproject.org/docs/tor-manual.html.en
Thanks for running an exit! More to come ;)
On 10/1/2015 3:33 PM, Dhalgren Tor wrote:
Have a new exit running in an excellent network on a very fast server with AES-NI. Server plan is limited to100TB so have set a limit slightly above this (18000000 bytes/sec) thinking that bandwidth would run 80-90% of the maximum and average to just below the plan limit. After three days the assigned bandwidth for the relay is going up instead of moderating--looks like the measurement system has a problem in this case. Just dropped the limit to 16500000 to stay below 100TB with maximum load.
A good number appears to be around 65000 to 70000, but 98000 was just assigned.
What can be done to extract proper rating from measurement system? Tried setting TokenBucketRefillInterval to 10 milliseconds for more exact control but this has not helped. Should an IPTABLES packet-dropping limit be established? Can the rating system be fixed?
This relay appears to have the same problem:
sofia
https://atlas.torproject.org/#details/7BB160A8F54BD74F3DA5F2CE701E8772B84185...
On Thu, Oct 1, 2015 at 12:33 PM, Dhalgren Tor dhalgren.tor@gmail.com wrote:
Have a new exit running in an excellent network on a very fast server with AES-NI. Server plan is limited to100TB so have set a limit slightly above this (18000000 bytes/sec) thinking that bandwidth would run 80-90% of the maximum and average to just below the plan limit. After three days the assigned bandwidth for the relay is going up instead of moderating--looks like the measurement system has a problem in this case. Just dropped the limit to 16500000 to stay below 100TB with maximum load.
What can be done to extract proper rating from measurement system? . . . Can the rating system be fixed?
On 10/01/2015 06:28 PM, Dhalgren Tor wrote:
This relay appears to have the same problem: sofia https://atlas.torproject.org/#details/7BB160A8F54BD74F3DA5F2CE701E8772B84185...
This is one of ours, and works just fine and the way it's supposed to?
Your 18000000 is quite near the 16.5 MByte/s it is currently pushing since you must have changed something on Sept 26/27, so I don't really see the issue. As said before in this thread, the consensus weight is a unitless unit that is relative to the rest of the network and of no 'external significance'.
If my quick calculation isn't off, 18000000 gives you 42.4TB per direction, which means your relay will stay below the projected 100TB limit.
How exactly do you determine that you see "too many connections"? Do you have any errors in the Tor log?
On Thu, Oct 1, 2015 at 5:12 PM, Moritz Bartl moritz@torservers.net wrote:
On 10/01/2015 06:28 PM, Dhalgren Tor wrote:
This relay appears to have the same problem: sofia https://atlas.torproject.org/#details/7BB160A8F54BD74F3DA5F2CE701E8772B84185...
This is one of ours, and works just fine and the way it's supposed to?
Certainly 'sofia' is working well enough, but it's clearly spending much if it's time at or somewhat above the configured rate-limit in terms of load. This is sub-optimal for end-user latency because the relay delays traffic to enforce the rate limit. On this relay BandwidthBurst is unconfigured and perhaps setting it to the same value as BandwidthRate will cause the authorities to slightly lower the rating and eliminate the saturated state.
Your 18000000 is quite near the 16.5 MByte/s it is currently pushing since you must have changed something on Sept 26/27, so I don't really see the issue.
You are overlooking TCP/IP protocol bytes which add between 5 and 13% to the data and are considered billable traffic by providers. At 18M it's solidly over 100TB, at 16.5M it will consume 97TB in 31 days.
As said before in this thread, the consensus weight is a unitless unit that is relative to the rest of the network and of no 'external significance'.
YES I understand this. Nowhere do I say I expect the consensus weight to correspond directly to BandwidthRate. What I SAID is that ,based on comparative observation, the Dhalgren relay should be rated around 65000 to effect an approximate 90% utilization of the 18M limit. THIS is supposedly the intended design objective of the bandwidth allocation system.
If my quick calculation isn't off, 18000000 gives you 42.4TB per direction, which means your relay will stay below the projected 100TB limit.
Add TCP/IP overhead. I am looking at the service provider bandwidth consumption graph when determining the setting as well as including TCP/IP overhead in calculations.
How exactly do you determine that you see "too many connections"? Do you have any errors in the Tor log?
I determine this by
1) watching the service provider bandwidth graph
2) watching the output of "SETEVENTS BW" on a control channel and observing that every sample shows the relay is flat-line saturated at BandwidthRate.
3) observing that statistics show elevated cell-queuing delays when the relay has been in the saturated state, e.g.
cell-queued-cells 2.59,0.11,0.01,0.00,0.00,0.00,0.00,0.00,0.00,0.00 cell-time-in-queue 107,25,3,3,4,3,7,4,1,7
4) explicitly browsing through the relay utilizing "SETCONF ExitNodes=" and observing that latency is at minimum degraded and is sometimes terrible when the relay is overrated/saturated, while on the other hand latency is extraordinarily good when the relay is not in a saturated / rate-limited state.
On Thursday, October 1, 2015 3:05pm, "Dhalgren Tor" dhalgren.tor@gmail.com said: [snip]
You are overlooking TCP/IP protocol bytes which add between 5 and 13% to the data and are considered billable traffic by providers. At 18M it's solidly over 100TB, at 16.5M it will consume 97TB in 31 days.
Another consumer of bandwidth is name resolution, if this is an exit node. And the traffic incurred by the resolutions is not reflected in the relay statistics.
An exit node that allocates 100% of it's bandwidth to relaying traffic will starve the resolver, and vice versa.
On Thu, Oct 1, 2015 at 7:45 PM, Steve Snyder swsnyder@snydernet.net wrote:
Another consumer of bandwidth is name resolution, if this is an exit node. And the traffic incurred by the resolutions is not reflected in the relay statistics.
An exit node that allocates 100% of it's bandwidth to relaying traffic will starve the resolver, and vice versa.
Absolutely true where physical bandwidth is the limiting factor.
However please note this thread/topic is explicitly in regard to relays that have unconfined gigabit Ethernet connectivity in excellent capacity networks, but that must limit bandwidth consumption in order to avoid billing-plan overuse charges.
Loss of DNS resolver traffic is not a concern here.
In this specific case it appears that the Tor bandwidth allocation system "over rates" subject relays to the point where the relays will internally apply rate-limit throttling, thereby degrading end-user latency. This is not optimal and is undesirable.
As stated earlier in this thread, if the consensus weight of the relay in question does not moderate within a few days and unless a better idea materializes, an IPTABLES packet-dropping rate limit will be applied. This will cause the under-utilized gigabit connection to behave similarly to a heavily utilized lower-bandwidth connection. This should result in a lower authority rating that does not saturate the relay, while still making use of the intended amount of bandwidth.
On Thu, 1 Oct 2015 19:05:38 +0000 Dhalgren Tor dhalgren.tor@gmail.com wrote:
- observing that statistics show elevated cell-queuing delays when
the relay has been in the saturated state, e.g.
cell-queued-cells 2.59,0.11,0.01,0.00,0.00,0.00,0.00,0.00,0.00,0.00 cell-time-in-queue 107,25,3,3,4,3,7,4,1,7
So?
Using IP tables to drop packets also is going to add queuing delays since cwnd will get decreased in response to the loss (CUBIC uses beta of 0.2 IIRC).
It may be less queuing delay (note: write() completes the moment data is in the outgoing buffer kernel side, so it may not be as apparent, and is somewhat harder to measure), and it's your relay so I don't care what you do, so do whatever you think works.
That said, placing an emphasis on unit-less quantities generated by a measurement system that is currently held together by duct tape, string, and chewing gum seems rather pointless and counter productive.
Regards,
On Thu, Oct 1, 2015 at 10:17 PM, Yawning Angel yawning@schwanenlied.me wrote:
Using IP tables to drop packets also is going to add queuing delays since cwnd will get decreased in response to the loss (CUBIC uses beta of 0.2 IIRC).
Unfortunately true. Empirical arrival to a better result is the idea.
When saturated and rate-limiting, the relay sometimes is so bad connections time-out. Consistent though less-than-amazing performance is better than erratic sometimes-failing performance IMO. Tried a few exit relays that appear to be limited by 100MB physical links and have consensus weights around 60k (i.e. TCP congestion control is at work though the load is not excessive) and they function much better.
It may be less queuing delay (note: write() completes the moment data is in the outgoing buffer kernel side, so it may not be as apparent, and is somewhat harder to measure), and it's your relay so I don't care what you do, so do whatever you think works.
That said, placing an emphasis on unit-less quantities generated by a measurement system that is currently held together by duct tape, string, and chewing gum seems rather pointless and counter productive.
Really?
The consensus weight has a precise and predictable effect on the amount of traffic directed to the relay. So gaming the measuring system for a weight that yields the best-possible user experience is not "pointless." I am paying for this.
Does seem the system generating the measurements has problem and if someone can look at this issue that would seem "productive."
Still interested in hearing "a better idea."
On 2 Oct 2015, at 01:19, Dhalgren Tor dhalgren.tor@gmail.com wrote:
On Thu, Oct 1, 2015 at 10:17 PM, Yawning Angel yawning@schwanenlied.me wrote:
Using IP tables to drop packets also is going to add queuing delays since cwnd will get decreased in response to the loss (CUBIC uses beta of 0.2 IIRC).
Unfortunately true. Empirical arrival to a better result is the idea.
When saturated and rate-limiting, the relay sometimes is so bad connections time-out. Consistent though less-than-amazing performance is better than erratic sometimes-failing performance IMO. Tried a few exit relays that appear to be limited by 100MB physical links and have consensus weights around 60k (i.e. TCP congestion control is at work though the load is not excessive) and they function much better.
It may be less queuing delay (note: write() completes the moment data is in the outgoing buffer kernel side, so it may not be as apparent, and is somewhat harder to measure), and it's your relay so I don't care what you do, so do whatever you think works.
That said, placing an emphasis on unit-less quantities generated by a measurement system that is currently held together by duct tape, string, and chewing gum seems rather pointless and counter productive.
Really?
The consensus weight has a precise and predictable effect on the amount of traffic directed to the relay. So gaming the measuring system for a weight that yields the best-possible user experience is not "pointless." I am paying for this.
Does seem the system generating the measurements has problem and if someone can look at this issue that would seem "productive."
Still interested in hearing "a better idea.”
We could modify the *Bandwidth* options to take TCP overhead into account. Alternately, we could modify the documentation to explicitly state that TCP overhead and name resolution on exits (and perhaps other overheads?) *isn’t* taken into account by those options. This would inform relay operators to take the TCP and DNS overheads for their particular setup into account when configuring the *Bandwidth* options, if the overhead is significant for them.
You suggested TCP overhead was 5%-13%, I can include that in the manual. Do we know what fraction of exit traffic is DNS requests? Are there any other overheads/additional traffic we should note while updating the manual? (Or would would you suggest that we update the code? I’m not sure how much this actually helps, as, once deployed to all relays, the consensus weights for all relays that set a *Bandwidth* options would come out slightly lower, and other relays without *Bandwidth* options set would take up the load.)
I’ve updated https://trac.torproject.org/projects/tor/ticket/17170 https://trac.torproject.org/projects/tor/ticket/17170 with the Exit DNS.
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
On Thu, Oct 1, 2015 at 11:41 PM, Tim Wilson-Brown - teor teor2345@gmail.com wrote:
We could modify the *Bandwidth* options to take TCP overhead into account.
Not practical. TCP/IP overhead varies greatly. I have a guard that averages 5% while the exit does 10% when saturated and more when running in good balance. Depends on the TCP packet payload sizes and header options, which can differ from connection to connection.
Alternately, we could modify the documentation to explicitly state that TCP overhead and name resolution on exits (and perhaps other overheads?) *isn’t* taken into account by those options. This would inform relay operators to take the TCP and DNS overheads for their particular setup into account when configuring the *Bandwidth* options, if the overhead is significant for them.
A good idea. Easy to overlook even for the experienced and a simple reminder goes a long way.
You suggested TCP overhead was 5%-13%, I can include that in the manual. Do we know what fraction of exit traffic is DNS requests?
Seems relatively miniscule though I haven't dug into it. Some resolver queries go over TCP while most are UDP. 'unbound' works so well I haven't spent much time with it. Have 'unbound' writing statistics to syslogd and if I see anything useful will append it to this thread. Some relay operators configure ISP DNS where the cache resides on another machine.
Are there any other overheads/additional traffic we should note while updating the manual?
Think that covers it. Operator 'ssh' is nothing, relay traffic the 1000 lb Gorilla.
(Or would would you suggest that we update the code? I’m not sure how much this actually helps, as, once deployed to all relays, the consensus weights for all relays that set a *Bandwidth* options would come out slightly lower, and other relays without *Bandwidth* options set would take up the load.)
Per above, impossible to figure. Better to remind folks when they refer to the documentation.
What I do is have a script that writes 'ifconfig ethX' to a file once a day, in sync with the service-provider accounting time- zone. Have an 'awk' script that calculates bytes for each day --is very precise including all traffic and overheads. The script subtracts-out MAC headers (using packet counts) since the 14 bytes does not hit the WAN and is not billable.
Perhaps passing mention of 'ifconfig' statistics in the manual is worthwhile. 'ip -s link show eth0X' truncates byte counters (on some distros anyway) and is useless.
"So" indeed. For the time that was under discussion:
cell-stats-end 2015-10-02 00:28:54 (86400 s) cell-processed-cells 20220,420,72,18,8,4,1,1,1,1 cell-queued-cells 2.00,0.25,0.01,0.00,0.09,0.10,0.02,0.00,0.00,0.00 cell-time-in-queue 203,131,17,7,2832,6198,3014,802,21,26 cell-circuits-per-decile 126717
. . .horrible
On 10/1/15, Yawning Angel yawning@schwanenlied.me wrote:
On Thu, 1 Oct 2015 19:05:38 +0000 Dhalgren Tor dhalgren.tor@gmail.com wrote:
- observing that statistics show elevated cell-queuing delays when
the relay has been in the saturated state, e.g.
cell-queued-cells 2.59,0.11,0.01,0.00,0.00,0.00,0.00,0.00,0.00,0.00 cell-time-in-queue 107,25,3,3,4,3,7,4,1,7
So?
You're saying that you're on a 1Gbit/s link, but you are only allowed to use 100Mbit/s. Is this averaged over some timescale? If so, you could try and play around with the 'RelayBandwidthBurst' setting. Increasing the Burst might help reduce the queue delay when you're near saturation, assuming the traffic is not constant and you're not over-saturated most of the time.
I don't know the measuring system, but I doubt that random packet dropping with iptables will have a noticeable effect on the measured bandwidth, as long as you don't drop enough packets to horribly degrade user experience.
Am 02.10.2015 um 10:16 schrieb Dhalgren Tor:
"So" indeed. For the time that was under discussion:
cell-stats-end 2015-10-02 00:28:54 (86400 s) cell-processed-cells 20220,420,72,18,8,4,1,1,1,1 cell-queued-cells 2.00,0.25,0.01,0.00,0.09,0.10,0.02,0.00,0.00,0.00 cell-time-in-queue 203,131,17,7,2832,6198,3014,802,21,26 cell-circuits-per-decile 126717
. . .horrible
On 10/1/15, Yawning Angel yawning@schwanenlied.me wrote:
On Thu, 1 Oct 2015 19:05:38 +0000 Dhalgren Tor dhalgren.tor@gmail.com wrote:
- observing that statistics show elevated cell-queuing delays when
the relay has been in the saturated state, e.g.
cell-queued-cells 2.59,0.11,0.01,0.00,0.00,0.00,0.00,0.00,0.00,0.00 cell-time-in-queue 107,25,3,3,4,3,7,4,1,7
So?
tor-relays mailing list tor-relays@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays
On 10/2/15, jensm1 jensm1@bbjh.de wrote:
You're saying that you're on a 1Gbit/s link, but you are only allowed to use 100Mbit/s. Is this averaged over some timescale?
More than 100MB which is 60 TB/month total for both directions. Is 100 TB/month, a common usage tier. Has a FUP (fair usage policy) attached, which means that the bandwidth should be evenly consumed over accounting interval rather than in in bursts.
If so, you could try and play around with the 'RelayBandwidthBurst' setting. Increasing the Burst might help reduce the queue delay when you're near saturation, assuming the traffic is not constant
Would make it worse, not better. A higher burst rate will allow the measurement to increase; the average limit must stay the same regardless. Best to set burst-max == average max. Tor relays allow some bursting regardless of the BandwidthRate setting anyway.
and you're not over-saturated most of the time.
This is the problem and the assumption is not correct. The link is saturated MOST of the time due to the overrating.
I don't know the measuring system, but I doubt that random packet dropping with iptables will have a noticeable effect on the measured bandwidth, as long as you don't drop enough packets to horribly degrade user experience.
NOT RANDOM. iptables will drop packets above the configured rate--exactly the same as a switch port dropping egress packets in excess of 100 MBit, a common lower-speed attachment type.
At present the bandwidth measurement system assesses bandwidth-constrained links that drop above-rate packets more correctly than unconstrained links where the relay daemon rate-limits via queuing and protocol flow-control.
To have the relay in question rated properly the best idea so far is to add an iptables rate limit to the mix. Will test keeping the BandwidthRate setting set slightly below the iptables limit in case that works better than an iptables limit alone.
Was going to wait a few days before reporting back, but early results are decisive.
The overload situation continued to worsen over a two-day period, with consensus weight continuing to rise despite the relay often running in a state of extreme overload and performing its exit function quite terribly.
Applied the following to the server, which has a single network interface:
tc qdisc add dev eth0 handle ffff: ingress tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 match ip protocol 6 0xff police rate 163600kbit burst 6mb drop flowid :1
The above two lines put in effect a "traffic control" ingress policing filter that drops TCP protocol packets in excess of 163.6 MBPS. This is 13.6% above the original rate-limit target of 18Mbyte/sec of payload data. UDP traffic is not restricted so the rule will not impact most 'unbound' resolver responses nor 'ntpd' replies.
Statistics for the filter are viewed with the command
tc -s filter show dev eth0 root
With the initial settings the bandwidth filter discarded 0.25% of incoming TCP packets--in line with what one sees in 'netstat -s' statistics for a not-overloaded relay. However the 'tor' daemon went straight back into the throttling state and, while improving slightly, continued to operate unacceptably.
Next BandwidthRate was set to parity with the 'tc' filter or 20Mbyte/sec. Result was a dramatic increase in dropped packets to 1.5% and a dramatic improvement of the responsiveness of the relay. If fact the exit now runs great.
The bandwidth measurement system now looks like a secondary issue. The big problem is that Tor daemon bandwidth throttling sucks and should be avoided in favor of Linux kernel rate-limiting where actual bandwidth exceeds bandwidth allocated to Tor, whatever the motivation. TCP is much better at dealing with congestion than the relay internal limit-rate logic.
The above 'tc' filter is simplest possible manifestation and works well for an interface dedicated to Tor. More nuanced rules can be crafted for situations where Tor traffic coexists with other types such that a limit will apply only to the Tor traffic.
-----
Note the applied 163.6mbps rate-limit is above the 150.0mbps rate that works out to exactly 100TB per month of data. The intent is to use approximately 96% of the 100TB quota and the 'tc' rate will be gradually fine-tuned to effect this outcome. BandwidthRate and BandwidthBurst are are now reset to the 1Gbyte no-effect default value.
The 6mb buffer/burst chosen is an educated guess (very little searchable info on this) and equates to 50ms at 1gbps or 300ms at 163mbps. The number also corresponds with a bit more than what one might reasonably expect as the per-port egress buffer capacity of a enterprise-grade switch.
On 3 Oct 2015, at 19:09, Dhalgren Tor dhalgren.tor@gmail.com wrote:
Was going to wait a few days before reporting back, but early results are decisive.
The overload situation continued to worsen over a two-day period, with consensus weight continuing to rise despite the relay often running in a state of extreme overload and performing its exit function quite terribly. ... The bandwidth measurement system now looks like a secondary issue. The big problem is that Tor daemon bandwidth throttling sucks and should be avoided in favor of Linux kernel rate-limiting where actual bandwidth exceeds bandwidth allocated to Tor, whatever the motivation. TCP is much better at dealing with congestion than the relay internal limit-rate logic.
…
You might find Rob Jansen's KIST paper interesting reading - it is about making Tor aware of kernel buffers. (Currently, Tor doesn’t know what happens after it writes data to the socket.)
http://www.robgjansen.com/publications/kist-sec2014.pdf http://www.robgjansen.com/publications/kist-sec2014.pdf
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP 968F094B
teor at blah dot im OTR CAD08081 9755866D 89E2A06F E3558B7F B5A9D14F
tor-relays@lists.torproject.org