Hi all :)
Here goes the initial notes and first meeting minutes about the Onion Services SRE work which is part of the Sponsor 123 project.
# Onion Services Site Reliability Engineering - Kickstart
## About
The Onion Services SRE is focusing on procedures for automated setup and maintenance of high availability Onion Services sites.
### Objectives and key results (OKRs)
Objective: this project is part of the "Increase the adoption of Onion Services" priority for 2022 (https://miro.com/app/board/uXjVOQz6oZg=/), for which we can select the following goals:
0. Easy to setup and maintain. How to measure this?
1. Sane defaults. How to measure this?
2. Configurable and extensible. How to measure this?
## Initial plan
0. Meeting with dgoulet, hiro and anarcat to get advice on kickstarting the project: what/where to look for about specs, tools, goals, security checklists, limits etc (meeting minutes bellow).
1. Research on all relevant deployment technologies: build a first matrix.
2. Then meeting with the media organizations: inventory, compliances check etc.
3. Build the second matrix (use cases).
## Kickstart meeting agenda
### Dimensions
Split discussion in two dimensions:
0. What are the possible architectures to build an Onion balanced service?
1. What are the available stacks/tools to implement those architectures?
### Initial considerations
While brainstorming about this project, the following considerations were sketched:
0. Software suite: Sponsor 123 project includes provisioning/monitoring onion services as deliverables, but the effort could be used to create a generic product (a "suite") which would include an Onionbalance deployer.
1. Key generation: such suite could generate all .onion keys locally (sysadmin's box), encrypting and pushing to a private repository (at least for the frontend onion keypair for each site/service). Then other sysadmins could clone that internal repository and start managing the sites/machines.
2. Disposability: depending on design choice, the frontend .onion address could be the only persistent data, and everything else could be disposable/recycled/recreated in case of failure or major infrastructure/design revamp..
That of course depends if Onionbalance supports backend rotation.
Consequence: initial effort could be focused in a good frontend implementation (Oniobalance instance etc), while backends and other nodes could be reworked later if time is limited right now.
3. Elasticity: that leads to the follow requirement: shall this system be such that backend nodes can be added and removed at will? In a way that enables the system to be elastic in the future, adding and removing nodes according to average load or sysadmin command? Or that would need the Onionbalance instance to be restarted (resulting in unwanted downtimes), or even breaking it?
Currently Onionbalance supports only up to 8 backends: https://gitlab.torproject.org/tpo/core/onionbalance/-/issues/7
Initial proposal for Sponsor 123 project would be to use a fixed number of 8 backends per .onion site, but experiments could be made to test if a site could have a dynamic number of backends.
4. Uniformity with flexibility: looks like most (if not all) sites can have the same "CDN" fronting setup, while it's "last mile"/endpoints might be all different. That said, the "first half" of the solution could be based in the same software suite and workflow which could be flexible enough to accept distinct endpoint configurations.
5. External instance(s): for the Sponsor 123 contract, a single instance of this "CDN" solution could be used to manage all sites, instead of having to manage many instances (and dashboards) in parallel.
Future contracts with other third-parties could either be managed using that same instance or having their own instances (isolation).
6. Internal instance: another, internal instance could be set to manage all sites listed at https://onion.torproject.org if TPA likes and decides to adopt the solution :)
7. Migration support: the previous point would depend in built-in support to migrate existing onion services into the CDN instance.
8. Other considerations: see rhatto's skill-test research.
### Questions
General:
0. If you were the Onion Services SRE, how would you implement this project?
1. What existing solutions to look at, and what to avoid?
2. What limits we should expect from the current technology, and how we could work around those?
Architecture:
0. What people think about the architecture proposed by rhatto during his skill-test (without paying attention to the improvised implementation he coded)?
1. Tor daemon is single process, no threads. How it scales under load for Onion Services and with a varying number of Onion Services?
2. Which other limits are important to be considered in the scope of this project, like the current upper bound of 8 Onionbalance backend servers?
Implementation:
0. What are the dimensions for the comparison matrix of existing DevOps solutions such as Puppet, Ansbile, Terraform and Salt (and specific modules/recipes/cookbooks /roles)?
1. Shall this suite be tested using Chutney or via the shadow simulator (Gitlab CI)? Makes sense?
2. Which other tests should be considered?
3. How TPA manages passphrases and secrets for existing systems and keys?
4. What (if any) TPA (or other) security policies should be observed in this project?
5. Which solutions are in use to manage the sites listed at https://onion.torproject.org/?
Management:
0. Sponsor 123 Project Plan timeline predicts setup of first .onion sites on M1 and M2, with 2-5 business days to set up a single .onion site. But coding a solution could take longer. How to do then? Suggested approach is to have a detailed discovery phase while coding the initial solution in parallel. Some rework migth be needed, but we can gain time in overall.
## Possible next tasks
0. Gather all relevant docs on onion services.
1. Build a comprehensible Onion Service checklist/documentation, including stuff like: * Basic: * Relay security checklist (if exists).
* Best practices and references: see existing and legacy docs like https://gitlab.torproject.org/legacy/trac/-/wikis/doc/OperationalSecurity
* Making sure the system clock is synchronized.
* Setup the Onion Location header.
* Encrypted backup of .onion keys.
* Optional: * Vanity address generation (using `mkp224o`)?
* Setup HTTPS with valid x509 certificates (and automatic HTTP -> HTTPS connection upgrade).
* Setup Onion Names (HTTPS Everywhere patch, or whatever is on it's place).
* Onion v3 auth (current unsupported by Onionbalance, see https://gitlab.torproject.org/tpo/core/onionbalance/-/issues/5).
2. Create repository (suggested name: Oniongroove - gives groove to Onionbalance!).
3. Write initial spec proposal for the system after both matrixes are ready and other requirements are defined, dealing with architecture, implementation and UX.
4. Write a threat model for the current implementation, considering issues such as lack of support for offline Onion Services keys, which makes the protection of the frontend keys a critical issue.
5. Create tickets for these and other tasks.
## Meeting Minutes - 2022-02-08
### Participants
* Anarcat * David * Hiro * Rhatto
### Discussion
(Free note taking, don't necesarilly/preciselly represents what people said)
Rhatto:
* Short intro, summarizing stuff above.
Hiro:
* When talking last year with NYT: something that helps on the community side: someone that does a course (bootcamp) on devops, applying stuff like Terraform, Ansible etc.
* What would be easier to rhatto to do (script, ansible ).
Anarcat:
* Puppet/agent: * TPA is a big puppet shop. But don't think it's the good tool for the job: too central, like having a central puppet server. * Also ansible is more popular. * Ansible has little requirement, easier to deploy and to reuse. * Not sure about Terraform, issues provisioning to Hetzner or Ganeti.
Rhatto:
* Maybe something to deploy node instances and atop of that using stuff like ansible to provision the services? * How TPA provision nodes at Hetzner and Ganeti? * Shall we look at Kubernetes?
Anarcat:
* Before joining Tor: what kind of a mess and shell scripts. * Wrote a kind of a debian installer with python + fabric. * The installer makes a machine configured up to be added do LDAP/Puppet. * Maybe a MVP that uses ansible (services setup) and then another using Terraform (node setup).
Hiro:
* Docker Swarm using Terraform. * Likes ansible (because of python+ssh only requirement). * About Kubernetes: same issue with puppet: have to run a centralized set of control nodes. * Ansible: lots of recipes available to harden the machine. * Puppet is complicated I think because it works for your own infrastructure. * It works for companies because it is tailored to providers.
David:
* There are lots of recipes and blog posts about ansible for Tor.
Anarcat:
* Docker: does provide some standard environment. * Like what rhatto did at his skill test. * Question with Docker: what to do? Swawm, Kubernets, Compose? Irony with Docker, with is not obvious in how to use at production. * Docker might be interesting for use to produce docker containers. * Part of the job is to do that evaluation.
Rhatto:
* Could do all this research.
Anarcat:
* About stopping using NGINX: having troubles with the blog, upstream charging a lot for the traffic. * NGINX: generic webserver, had heard lots of good things about. * Setup 2 VMs caching the blog, but them retired as the caching is not. * NGINX as an opencore, specially tricking when you want to do monitoring. * OpenResty is very interesting.
Hiro:
* OpenResty: similar opencore model like NGINX.
Rhatto:
* How to connect the sollution and the endpoint. * Questions: * Local .onion keypair generation is a good approach? * Could offline .onion keys support be in the roadmap? * Are backend keys disposable?
David:
* Offline key is very unlikelly to have it until the Rust rewrite. Would not bet on that to come for this project.
* Local key generationg and deployment: there will be a need for this.
* Would not bet that we could rotate the Onionbalance keys.
### Next
See "Possible next tasks" section above :)