[tor-dev] QUIC TOR Debugging Question (no attach)

Xiaofan Li xli2 at andrew.cmu.edu
Fri Apr 29 06:56:42 UTC 2016


Tim:
Sorry for not being specific enough on my questions. I'll try to give more
detailed questions later instead of higher-level problems.

Regarding the frequency of my emails, I apologize for the long intervals
but the reason is that I'm not full-time on this project and a lot of times
I had exams and I can only work on the QUIC TOR project for a couple of
days every week. Fortunately, I'm not nearly done with all my finals for
this semester and I can put more time into this project from now on.

Right now, I have two specific questions:

1. We just switched to testing on EmuLab (each node is a standalone
machine) from chutney. After the switch, a particular bug on chutney
disappeared: on chutney, some nodes used to crash mysteriously with no log
outputs (all the log simply stops, with no stack trace or anything). This
bug only occurs when there's existing cache (the first run after chutney
configure is fine). After porting onto EmuLab (a testing framework), using
almost identical torrc file, this bug disappeared and everything runs just
fine for now. Right now we are ignoring this bug. Have you seen similar
issues on chutney?

2. The circuit building process is taking too long and many of them
expires. We have 4 relays where 2 of which are also authorities. From the
logs, I'm seeing a lot of the following lines:

   - circuit_expire_building(): Abandoning circ XX XXXX:XX:12345 (state
   0,0: doing handshakes, purpose 5, len 3)
   -
   - router_choose_random_node(): We found 3 running nodes.
   router_choose_random_node(): We removed 1 excludednodes, leaving 2 nodes.
   router_choose_random_node(): We removed 2 excludedsmartlist, leaving 0
   nodes      .

The first line happens when we have connected to the first node and waiting
for a response from the second or sometimes the third relay. And the second
log happens when we are trying to choose the path to use for a circuit. *What
could I do to increase the number of available nodes? Should I increase the
frequency of reachability tests? *

After looking at the code, *there's a circuitbuild.c line 2172 describing
why some nodes are excluded, which I don't quite understand*. Specifically,
the comment says: "XXXX025 use the using_as_guard flag to accomplish this."
where can I find more information on this XXXX025 issue (committed here
<https://lists.torproject.org/pipermail/tor-commits/2013-March/053977.html>)?
*Why are these routers being excluded? *


Please let me know if you want more specific information on those issues.
Thank you!
Li.

On Sun, Apr 24, 2016 at 11:33 PM, Tim Wilson-Brown - teor <
teor2345 at gmail.com> wrote:

>
> > On 25 Apr 2016, at 06:44, Xiaofan Li <xli2 at andrew.cmu.edu> wrote:
> >
> > Hi Tim and everyone on tor-dev,
> >
> > Our QUIC + TOR project has almost been fully implemented. We are
> debugging the last few bits of bugs. Update:
> >       • We've now able to build many complete circuits with QUIC as its
> underlying protocol.
> >       • We have not debugged the actual communication part yet. We are
> aware of certain failure cases for QUIC (e.g. line 15642 of the log is
> being debugged right now). So we cannot send actually client data yet.
> >       • The current state uses QUIC for OR connections only. Thus a
> dual-path is implemented as suggested in my last email thread.
> >       • TLS is completely bypassed and important state (that is set up
> in tls_handshake functions) is preserved and refactored out. e.g.
> conn->/chan->state purpose, etc.
> >       • Some tinkering and re-designing of QUIC itself is also underway.
> The fact that QUIC is a transport protocol on application layer makes it
> painful to interact with the event and timer systems of TOR. We are trying
> to improve this aspect now.
> > The debugging log I was attaching was too big for the tor-dev list. So
> if you are interested to take a look at the file, let me know.
>
> Large debug logs contain too much information to be helpful to you or to
> us.
>
> Try warning, notice, or info level logs, in that order.
> Using high-level logs makes it easier to work out where your attempts to
> send data have broken down.
> Once you've identified where communication has broken down, try to fix it.
>
> If you can't fix it, you're welcome to ask for advice.
> Please quote a small number of relevant log messages, tell us what you
> think they mean, and what you've tried to do to fix it.
> Also feel free to provide a link to logs at that level for people to look
> through.
>
> This makes it more likely that people will recognise your issue and
> respond by helping you to fix it.
>
> > Some particularly concerning things in the log:
> >       • circuit_get_by_circid_channel_impl(): found nothing for circ_id
> 14801, channel ID 2 (0x7f758bb6b740)
> > Then it just attaches this circ onto this channel.. Is this normal?
> >       • Line 4901 circuit_receive_relay_cell(): Passing on unrecognized
> cell.
> > It happens a lot. Is this normal?
> >       • This sequence happened a lot around 7500.
> > relay_send_command_from_edge_(): delivering 10 cell forward.
> > circuit_package_relay_cell(): crypting a layer of the relay cell.
> > circuit_package_relay_cell(): crypting a layer of the relay cell.
> > circuit_package_relay_cell(): crypting a layer of the relay cell.
> > It seems like its decrypting and forwarding cells along. Is it normal
> for TOR (with TCP) to do this in a burst? Because I'm seeing about ~1s of
> repeated calls.
>
> I honestly don't think these are concerning at all. But I don't really
> know.
> And I can't find out, because I don't know which version of tor you've
> based your changes on.
>
> Here's how you can find out whether these log messages are typical or not:
>
> Run the original version of Tor that you've based your QUIC changes on,
> with the same network configuration.
> (Does it work? If not, your QUIC network will likely never work either.)
> Then compare the warning, notice, and info logs to tor with QUIC.
> Stop at the first log that differs in non-trivial ways.
> This is a log level that's useful for you.
> (High-level logs will also cause you less concern about spurious messages.)
>
> This way, you can answer your own questions about which logs and
> behaviours are normal, and which ones you've introduced.
> Feel free to report back with any log messages from the unmodified version
> of Tor that might indicate bugs.
>
> > Some more general questions:
> >       • Internal Circuits: any docs? What is it used for? Measuring
> bandwidth?
>
> Relay bandwidth testing, relay reachability testing (default chutney
> configs skip this using AssumeReachable), client directory fetches, hidden
> service directory document uploads, onion services (hidden services), …
>
> Read the ~12 instances of CIRCLAUNCH_IS_INTERNAL in the tor source code
> for more details.
>
> > How many internal circuits are required by the system?
>
> As many are as necessary to support the operation of the Tor client /
> relay / onion service at the current time.
> Initially, 2 or 3 (read circuit_predict_and_launch_new for more details).
>
> >       • circuit wide ID format. We had a bug regarding this last week.
> The check in process_create_cell always fails because line 281-295 in
> command.c always failed (the check for CIRC_ID_TYPE and id_is_high).
> Currently we commented out this check. What does it affect? And could we do
> this?
>
> I can't see how this could be your client communication issue. It's only
> an issue if the circuit IDs collide, which should be unlikely in small
> networks.
>
> When two relays create circuits on a connection, one uses the lower half
> of the circuit id space, and one uses the upper half. This prevents circuit
> IDs colliding. Read the definitions of circ_id_type, circ_id_type_t, and
> channel_set_circid_type for details.
>
> The version of the link protocol determines how this decision is made.
> I assume that your tor has chan->conn->link_proto >=
> MIN_LINK_PROTO_FOR_WIDE_CIRC_IDS.
> (You can check this by printing out the value of chan->conn->link_proto
> everywhere channel_set_circid_type is called.)
>
> So you've removed TLS client identity and TLS server identity keys.
> What do get_tlsclient_identity_key and get_server_identity_key return?
> Null bytes?
>
> Is there a publicly known key in QUIC that's known by both sides and
> stable for the life of a connection?
> If so, use that.
>
> If not, always pass 0 for consider_identity to channel_set_circid_type, so
> that the initiator uses the upper half of the circuit IDs, regardless of
> keys.
>
> Breaking other parts of the circuit management code could also cause this
> issue.
>
> >       • From a high level, when a client sends data using a circuit,
> what is its code path? Which special (as in, specific to client-initiated
> communication) functions are called?
>
> I'm not sure how to answer this question. The unhelpful but accurate
> answer is "not many codepaths are client-specific, if there are any at all".
>
> Regardless of its role in the network, every tor instance performs common
> operations like retrieving consensus documents and building circuits. And,
> if configured to do so, tor instances can perform multiple roles.
>
> Here are some high-level differences between client and server
> communication in the tor network that could be causing your issues:
>
> Typically, clients, onion services, and bridges retrieve directory
> documents using "begindir", a TLS connection to the ORPort. Relays and
> authorities do this unencrypted over the DirPort. If you haven't replaced
> TLS with QUIC correctly, clients may fail to bootstrap or retrieve
> directory documents. There should be log messages about this.
>
> Clients have a SOCKSPort open, and in response to application requests
> they make an AP (application) connection that's linked to a stream on a
> circuit that's been extended to the destination exit relay. They then send
> requests received on the SOCKSPort to the destination relay, and receive
> responses that they forward to the application. (The onion service setup is
> slightly more complex, but transmits data in a similar way.)
>
> Have you read torguts?
> https://gitweb.torproject.org/user/nickm/torguts.git/
>
> Any part of this process could break and cause client communication to
> fail.
> Parts of the relay code could also break in ways that cause client
> communication to fail.
>
> I can't see how to describe specific code paths without more specific (and
> precise) detail about what's failing, and whether it's failing on clients
> or relays. You can find this in the logs, if you log sensibly. Let us know
> what you find, and what you tried to do to fix it.
>
> What high-level success or failure message (warning, log, info) is logged
> on the client right after you try to make an application connection?
> Does the connection reach a relay? The exit? DNS? The remote site?
> What warning, notice, or info-level message is logged on the last tor node
> where the connection stops working?
> (Or what DNS or HTTP request is sent to the remote server / site?)
>
> > Any other comment on the log is greatly appreciated, since everyone here
> is probably more familiar than me with what a normal bootstrapping process
> would look like.
>
> Don't worry too much about the log messages. They're designed to be used
> for debugging once there is a known issue.
> The vast majority are harmless, and many need context to interpret. You
> can find this context by searching the tor code for unique words or phrases
> in the log message. (But keep in mind that log strings are often composed
> of shorter strings.)
>
> Some general requests for future questions:
>
> It would be much easier and faster for me (and perhaps others) to help you
> if you asked questions after trying to identify and fix issues yourself. I
> encourage you to try some of the things I've suggested, and ask more
> precise questions next time.
>
> Personally, I would find it easier to respond to targeted questions that
> come one at a time, every few days or every week, rather than a large email
> every few weeks.
>
> It might also be helpful to be able to see the source code you're working
> on, rather than trying to guess, what changes you've made, from what I
> remember, about what you said, about your design, in previous emails.
>
> Tim
>
> Tim Wilson-Brown (teor)
>
> teor2345 at gmail dot com
> PGP 968F094B
> ricochet:ekmygaiu4rzgsk6n
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20160429/8b5137e6/attachment-0001.html>


More information about the tor-dev mailing list