0. Summary
I've got some end-of-month deliverables to try to figure out how Tor fits together to identify the most important areas to test, the hardest areas to test, and the best ways to modularize Tor in the future. Here's my current draft. Some of these ideas are half-baked; many will need more rounds of revision before they can be acted on.
For more information see tickets #15236 and #15234.
1. Testing Tor: Where we should focus
1.1. Let's focus on complicated, volatile things.
These functions are the highest in length and in complexity:
circuit_expire_building parse_port_config run_scheduled_events connection_ap_handshake_rewrite_and_attach options_act networkstatus_parse_vote_from_string connection_dir_client_reached_eof directory_handle_command_get networkstatus_compute_consensus options_validate
If we focus on the number of commits that have code still live in a function, we have:
circuit_expire_building parse_port_config run_scheduled_events connection_ap_handshake_rewrite_and_attach options_act networkstatus_parse_vote_from_string connection_dir_client_reached_eof directory_handle_command_get networkstatus_compute_consensus options_validate
And if we want to focus on average commits per line of code in a function, weighted in various ways, these ones stand out:
add_stream_log_impl connection_close_immediate connection_connect_sockaddr connection_dir_client_reached_eof connection_edge_destroy connection_exit_begin_conn connection_handle_write_impl consider_testing_reachability directory_info_has_arrived init_keys options_act options_validate router_add_to_routerlist router_dump_router_to_string tor_free_all router_rebuild_descriptor run_scheduled_events
If we focus on how many changes have *ever* been made to a function, we get these as the top offenders:
router_rebuild_descriptor onion_extend_cpath directory_handle_command dns_resolve assert_connection_ok connection_handle_write connection_handle_listener_read dns_found_answer circuit_extend main connection_edge_process_inbuf connection_ap_handshake_attach_circuit config_assign init_keys circuit_send_next_onion_skin connection_dir_process_inbuf prepare_for_poll connection_exit_begin_conn connection_ap_handshake_process_socks router_dump_router_to_string fetch_from_buf_socks getconfig run_scheduled_events connection_edge_process_relay_cell do_main_loop
No single list above is uniquely the right one to focus on; instead, we should probably use them in tandem to identify areas to work on.
1.2. What will be hardest to test?
The entire "blob" described in section 2.1 below should be considered hard-to-test because of the number of functions reachable from one another. Removing some of this callgraph complexity will help us test, but it's not completely trivial. See Appendix A for a list of functions of this kind.
2. Untangling Tor: Modules and You
If you've done a software engineering class, you probably got taught how to diagram the modules or layers of a software project. I tried this with Tor back in 2005 and my head spun; things have not gotten easier since then.
Looking at Tor's call graphs tell us a lot about our issues with modularizing Tor. In the abstract, we'd like to divide Tor into layers and modules, with most modules only calling a few others, and the "central" functions limited to as little code as possible. This will make modules more independent and help us test them in isolation. This will also help us with future design ideas like splitting Tor into separate processes for testing or security or robustness purposes.
2.1. Fun with callgraphs
I did some analysis of Tor's callgraph to figure out where we stand today. About 27% of the functions that tor calls (internally and externally) can reach only 10 or fewer other functions that Tor knows about. About 82% can reach a smaller portion of Tor.
Then there comes a block of functions that can reach _nearly every other function_ in the block, or in Tor. They are all reachable from one another. There are about 450 of these, or 11% of the functions in Tor. I'll call them "The Blob."
Finally, there are functions that can reach the Blob, and all the functions that the Blob calls, but which are not themselves part of the Blob. These are about 100 functions, or 2% of the functions in Tor.
Breaking up the Blob is a high priority for making Tor more modular.
So, how can we do that?
* Let's start by looking at which functions, if they were made unnecessary, or if they stopped calling into the Blob, would remove the most other functions from the Blob.
* Let's also start by hand-annotating the Blob with a list of functions that logically shouldn't be part of the Blob, and then seeing what they call that pulls them into the Blob.
Here are some preliminary results for the best functions to try to remove from the Blob, and some antipatterns to avoid:
* The following functions pull many, many other functions into the Blob: control_event_* router_rebuild_descriptor_ directory_fetches_from_authorities *mark_for_close* *free *free_
If we could stop them calling back into the blob, we would shrink the blob's size down to 254 functions, and reduce the number of outside functions calling into the blob to 74.
Also worth investigating is seeing whether to remove the call from 'connection_write_to_buf_impl_' to 'connection_handle_write', and the call from 'conection_handle_write_impl' to 'connection_handle_read'. Either of these changes reduces the number of functions reachable from blob functions severely, and shifts many functions from Blob-members to Blob-callers.
Finally, these might be more work to unblob:
'connection_or_change_state' 'tor_cleanup' 'init_keys' 'connection_or_close_for_error' But doing so would take the blob down to 60 functions, with the blob-callers to 118.
* Many functions that are meant to simply note that an event has occurred, so that some work must happen. We could refactor most of these to instead use a deferred callback type, rather than invoking the functions that call the work directly. This would also help us decouple Blob functions from one another.
* I have the blob functions listed in Appendix A.1, with the ones that would still in the blob after the changes above listed in Appendix A.2.
2.2. Separating modules into processes and libraries; module-level APIs.
Let's imagine this architecture: Tor runs as a series of conceptually separate processes that communicate over a reasonable IPC mechanism. These processes correspond to module boundaries; their APIs need to be carefully designed. There is a central communications process that does the work of transferring commands and replies from A to B.
APIs would be divided into publish/subscribe types and remote-procedure call types. Modules would be divided into stateful/non-stateful; non-stateful ones could be shared libraries; or if their processing is sensitive, they could be kept in very sandboxed processes.
How do we get there?
To make the proposal concrete, I suggest for a 0th-order draft that we imagine the following simplified version:
* That all src/common modules become shared libraries with no state.
* That we divide (or imagine how to divide) the contents of src/or into separate modules, with more clear connections between them. We imagine doing this one part at a time, starting by adding a nice simple publish/subscribe mechanism.
* That the controller API be reimagined as a a layer on top of the rest of this.
* That any multiprocess version of this be imagined, for concreteness, as ProtocolBufs over nanomsg. (I'm not tied to this particular format/transport combination; just proposing it for concreteness.
We would need a transition plan here. I'd suggest that some of the easiest modules to extract first would be voting, rephist, node information management, path selection, directory data, and hidden services. Most of these have the virtue of being more easily testable once they are extracted.
Once we have some significant pieces split into libraries, and submodules, we could start doing process-level refactoring, moving towards an architecture built out of processes and parts of these kinds:
* Network: Has libevent-driven main loop. Accepts commands in a pool, maybe from a subthread.
* Utility: Is just event-driven. Answers requests, gives responses.
* Manager: starts up processes as needed and connects them to one another.
* Dispatcher: relays messages from where they come from to where they go.
Our best next steps in this area, IMO, seem to be to identify one or two key areas and abstractions, and begin isolating them in practice so as to better get a feel for how this kind of transition works, so we can iterate our design.
[Appendices in attachment so as to keep this to a reasonable size.]