Summary: start upgrading servers during the Debian 13 ("trixie") freeze, if it goes well, complete most of the fleet upgrade in around June 2025, with full completion by the end of 2025, with a 2026 year free of major upgrades entirely. Improve automation, retire old container images.
Deadline: 2 weeks, 2025-04-01
# Background
Debian 13 ("trixie"), currently "testing", is going into freeze soon, which means we should have a new Debian stable release in 2025. It has been a long-standing tradition at TPA to collaborate in the Debian development process and part of that process is to upgrade our servers during the freeze. Upgrading during the freeze makes it easier for us to fix bugs as we find them and contribute them to the community.
The [freeze dates announced by the debian.org release team][] are:
2025-03-15 - Milestone 1 - Transition and toolchain freeze 2025-04-15 - Milestone 2 - Soft Freeze 2025-05-15 - Milestone 3 - Hard Freeze - for key packages and packages without autopkgtests To be announced - Milestone 4 - Full Freeze
We have entered the "transition and toolchain freeze" which locks changes on packages like compilers and interpreters unless exceptions. See the [Debian freeze policy][] for an explanation of each step.
Even though we've just completed the Debian 11 ("bullseye") and 12 ("bookworm") upgrades in late 2024, we feel it's a good idea to start *and* complete the Debian 13 upgrades in 2025. That way, we can hope of having a year or two (2026-2027?) *without* any major upgrades.
This proposal is part of the [Debian 13 trixie upgrade milestone][], itself part of the [2025 TPA roadmap][].
[freeze dates announced by the debian.org release team]: https://lists.debian.org/debian-devel-announce/2025/01/msg00004.html [Debian freeze policy]: https://release.debian.org/testing/freeze_policy.html [Debian 13 trixie upgrade milestone]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/12 [2025 TPA roadmap]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/roadmap/2025
# Proposal
As usual, we perform the upgrades in three batches, in increasing order of complexity, starting in 2025Q2, hoping to finish by the end of 2025.
Note that, this year, this proposal also includes upgrading the Tails infrastructure as well. To help with merging rotations in the two teams, TPA staff will upgrade Tails machines, with Tails folks assistance, and vice-versa.
## Affected users
All service admins are affected by this change. If you have shell access on any TPA server, you want to read this announcement.
In the past, TPA has typically kept a page detailing notable changes and a proposal like this one would link against the upstream release notes. Unfortunately, at the time writing, upstream hasn't yet produced release notes (as we're still in testing).
We're hoping the documentation will be refined by the time we're ready to coordinate the second batch of updates, around May 2025, when we will send reminders to affected teams.
We do expect the Debian 13 upgrade to be less disruptive than bookworm, mainly because Python 2 is already retired.
## Notable changes
For now, here are some known changes that are already in Debian 13:
| Package | 12 (bookworm) | 13 (trixie) | |--------------------|---------------|-------------| | Ansible | 7.7 | 11.2 | | Apache | 2.4.62 | 2.4.63 | | Bash | 5.2.15 | 5.2.37 | | Emacs | 28.2 | 30.1 | | Fish | 3.6 | 4.0 | | Git | 2.39 | 2.45 | | GCC | 12.2 | 14.2 | | Golang | 1.19 | 1.24 | | Linux kernel image | 6.1 series | 6.12 series | | LLVM | 14 | 19 | | MariaDB | 10.11 | 11.4 | | Nginx | 1.22 | 1.26 | | OpenJDK | 17 | 21 | | OpenLDAP | 2.5.13 | 2.6.9 | | OpenSSL | 3.0 | 3.4 | | PHP | 8.2 | 8.4 | | Podman | 4.3 | 5.4 | | PostgreSQL | 15 | 17 | | Prometheus | 2.42 | 2.53 | | Puppet | 7 | 8 | | Python | 3.11 | 3.13 | | Rustc | 1.63 | 1.85 | | Vim | 9.0 | 9.1 |
Most of those, except "tool chains" (e.g. LLVM/GCC) can still change, as we're not in the full freeze yet.
## Upgrade schedule
The upgrade is split in multiple batches:
- automation and installer changes
- low complexity: mostly TPA services and less critical Tails servers
- moderate complexity: TPA "service admins" machines and remaining Tails physical servers and VMs running services from the official Debian repositories only
- high complexity: Tails VMs running services not from the official Debian repositories
- cleanup
The free time between the first two batches will also allow us to cover for unplanned contingencies: upgrades that could drag on and other work that will inevitably need to be performed.
The objective is to do the batches in collective "upgrade parties" that should be "fun" for the team. This policy has proven to be effective in the previous upgrades and we are eager to repeat it again.
### Upgrade automation and installer changes
First, we tweak the installers to deploy Debian 13 by default to avoid installing further "old" systems. This includes the bare-metal installers but also and especially the virtual machine installers and container images.
Concretely, we're planning on changing the `latest` container image tag to point to `trixie` in early April. A full *year* later, the `bookworm` container images will be retired. Note that we are already planning the retirement of the "old stable" (`bullseye`) container images, see [tpo/tpa/base-images#19][], for which you may have already been contacted.
New `idle` canary servers will be setup in Debian 13 to test integration with the rest of the infrastructure, and future new machine installs will be done in Debian 13.
We also want to work on automating the upgrade procedure further. We've had catastrophic errors in the PostgreSQL upgrade procedure in the past, in particular, but the whole procedure is now considered ripe for automation, see [tpo/tpa/team#41485][] for details.
[tpo/tpa/base-images#19]: https://gitlab.torproject.org/tpo/tpa/base-images/-/issues/19 [tpo/tpa/team#41485]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41485
### Batch 1: low complexity
This is scheduled during two weeks: TPA boxes will be upgraded in the last week of April, and Tails in the first week of May.
The idea is to start the upgrade long enough before the vacations to give us plenty of time to recover, and some room to start the second batch.
In April, Debian should also be in "soft freeze", not quite a fully "stable" environment, but that should be good enough for simple setups.
35 TPA machines:
``` archive-01.torproject.org cdn-backend-sunet-02.torproject.org chives.torproject.org dal-rescue-01.torproject.org dal-rescue-02.torproject.org gayi.torproject.org hetzner-hel1-02.torproject.org hetzner-hel1-03.torproject.org hetzner-nbg1-01.torproject.org hetzner-nbg1-02.torproject.org idle-dal-02.torproject.org idle-fsn-01.torproject.org lists-01.torproject.org loghost01.torproject.org mandos-01.torproject.org media-01.torproject.org minio-01.torproject.org mta-dal-01.torproject.org mx-dal-01.torproject.org neriniflorum.torproject.org ns3.torproject.org ns5.torproject.org palmeri.torproject.org perdulce.torproject.org srs-dal-01.torproject.org ssh-dal-01.torproject.org static-gitlab-shim.torproject.org staticiforme.torproject.org static-master-fsn.torproject.org submit-01.torproject.org vault-01.torproject.org web-dal-07.torproject.org web-dal-08.torproject.org web-fsn-01.torproject.org web-fsn-02.torproject.org ```
4 Tails machines:
``` ecours.tails.net puppet.lizard skink.tails.net stone.tails.net ```
In the [first batch of bookworm machines][], we ended up taking 20 minutes per machine, done in a single day, but warned that the second batch took longer.
It's probably safe to estimate 20 hours (30 minutes per machine) for this work, in a single week.
Feedback and coordination of this batch happens in [issue batch 1][].
[first batch of bookworm machines]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41251 [issue batch 1]: "https://gitlab.torproject.org/tpo/tpa/team/-/issues/42071"
### Batch 2: moderate complexity
This is scheduled for the last week of may for TPA machines, and the first week of June for Tails.
At this point, Debian testing should be in "hard freeze", which should be more stable.
40 TPA machines:
``` anonticket-01.torproject.org backup-storage-01.torproject.org bacula-director-01.torproject.org btcpayserver-02.torproject.org bungei.torproject.org carinatum.torproject.org check-01.torproject.org ci-runner-x86-02.torproject.org ci-runner-x86-03.torproject.org colchicifolium.torproject.org collector-02.torproject.org crm-int-01.torproject.org dangerzone-01.torproject.org donate-01.torproject.org donate-review-01.torproject.org forum-01.torproject.org gitlab-02.torproject.org henryi.torproject.org materculae.torproject.org meronense.torproject.org metricsdb-01.torproject.org metricsdb-02.torproject.org metrics-store-01.torproject.org onionbalance-02.torproject.org onionoo-backend-03.torproject.org polyanthum.torproject.org probetelemetry-01.torproject.org rdsys-frontend-01.torproject.org rdsys-test-01.torproject.org relay-01.torproject.org rude.torproject.org survey-01.torproject.org tbb-nightlies-master.torproject.org tb-build-02.torproject.org tb-build-03.torproject.org tb-build-06.torproject.org tb-pkgstage-01.torproject.org tb-tester-01.torproject.org telegram-bot-01.torproject.org weather-01.torproject.org ```
17 Tails machines:
``` apt-proxy.lizard apt.lizard bitcoin.lizard bittorrent.lizard bridge.lizard dns.lizard dragon.tails.net gitlab-runner.iguana iguana.tails.net lizard.tails.net mail.lizard misc.lizard puppet-git.lizard rsync.lizard teels.tails.net whisperback.lizard www.lizard ```
The [second batch of bookworm upgrades][] took 33 hours for 31 machines, so about one hour per box. Here we have 57 machines, so it will likely take us 60 hours (or two weeks) to complete the upgrade.
Feedback and coordination of this batch happens in [issue batch 2][].
[second batch of bookworm upgrades]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41252 [issue batch 2]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42070
### Batch 3: high complexity
Those machines are harder to upgrade, or more critical. In the case of TPA machines, we typically regroup the Ganeti servers and all the "snowflake" servers that are not properly Puppetized and full of legacy, namely the LDAP, DNS, and Puppet servers.
That said, we waited a long time to upgrade the Ganeti cluster for bookworm, and it turned out to be trivial, so perhaps those could eventually be made part of the second batch.
15 TPA machines:
``` - [ ] alberti.torproject.org - [ ] dal-node-01.torproject.org - [ ] dal-node-02.torproject.org - [ ] dal-node-03.torproject.org - [ ] fsn-node-01.torproject.org - [ ] fsn-node-02.torproject.org - [ ] fsn-node-03.torproject.org - [ ] fsn-node-04.torproject.org - [ ] fsn-node-05.torproject.org - [ ] fsn-node-06.torproject.org - [ ] fsn-node-07.torproject.org - [ ] fsn-node-08.torproject.org - [ ] nevii.torproject.org - [ ] pauli.torproject.org - [ ] puppetdb-01.torproject.org ```
It seems like the [bookworm Ganeti upgrade][] took roughly 10h of work. We ballpark the rest of the upgrade to another 10h of work, so possibly 20h.
11 Tails machines:
``` - [ ] isoworker1.dragon - [ ] isoworker2.dragon - [ ] isoworker3.dragon - [ ] isoworker4.dragon - [ ] isoworker5.dragon - [ ] isoworker6.iguana - [ ] isoworker7.iguana - [ ] isoworker8.iguana - [ ] jenkins.dragon - [ ] survey.lizard - [ ] translate.lizard ```
The challenge with Tails upgrades is the coordination with the Tails team, in particular for the Jenkins upgrades.
Feedback and coordination of this batch happens in [issue batch 3][].
[bookworm Ganeti upgrade]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41254 [issue batch 3]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42069
### Cleanup work
Once the upgrade is completed and the entire fleet is again running a single OS, it's time for cleanup. This involves updating configuration files to the new versions and removing old compatibility code in Puppet, removing old container images, and generally wrapping things up.
This process has been historically neglected, but we're hoping to wrap this up, worst case in 2026.
## Timeline
- 2025-Q2 - W14 (first week of April): default container image changed to `trixie`, installer defaults changed and first tests in production - W18 (last week of April): Batch 1 upgrades, TPA machines - W19 (first week of May): Batch 1 upgrades, Tails machines - W22 (last week of May): Batch 2 upgrades, TPA machines - W23 (first week of June): Batch 2 upgrades, Tails machines - 2025-Q3 to Q4: Batch 3 upgrades - 2026-Q2: bookworm container image retired
## Deadline
The community has until the beginning of the above timeline to manifest concerns or objections.
Two weeks before performing the upgrades of each batch, a new announcement will be sent with details of the changes and impacted services.
# Alternatives considered
## Retirements or rebuilds
We do not plan any major upgrade or retirements in the third phase this time.
In the future, we hope to decouple those as much as possible, as the Icinga retirement and Mailman 3 became blockers that slowed down the upgrade significantly for bookworm. In both cases, however, the upgrades *were* challenging and had to be performed one way or another, so it's unclear if we can optimize this any further.
We are clear, however, that we will not postpone an upgrade for a server retirement. Dangerzone, for example, is scheduled for retirement ([TPA-RFC-78][]) but is still planned as normal above.
[TPA-RFC-78]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-78-dangerz...
# Costs
| Task | Estimate | Certainty | Worst case | |-------------------|----------|-----------|------------| | Automation | 20h | extreme | 100h | | Installer changes | 4h | low | 4.4h | | Batch 1 | 20h | low | 22h | | Batch 2 | 60h | medium | 90h | | Batch 3 | 20h | high | 40h | | Cleanup | 20h | medium | 30h | | **Total** | 144h | ~high | ~286h |
The entire work here should consist of over 140 hours of work, or 18 days, or about 4 weeks full time. Worst case doubles that.
The above is done in "hours" because that's how we estimated batches in the past, but here's an estimate that's based on the [Kaplan-Moss estimation technique][].
[Kaplan-Moss estimation technique]: https://jacobian.org/2021/may/25/my-estimation-technique/
| Task | Estimate | Certainty | Worst case | |-------------------|----------|-----------|------------| | Automation | 3d | extreme | 15d | | Installer changes | 1d | low | 1.1d | | Batch 1 | 3d | low | 3.3d | | Batch 2 | 10d | medium | 20d | | Batch 3 | 3d | high | 6d | | Cleanup | 3d | medium | 4.5d | | **Total** | 23d | ~high | ~50d |
This is *roughly* equivalent, if a little higher (23 days instead of 18), for example.
It should be noted that automation is not expected to drastically reduce the total time spent in batches (currently 16 days or 100 hours). The main goal of automation is more to reduce the likelihood of catastrophic errors, and make it easier to share our upgrade procedure with the world. We're still hoping to reduce the time spent in batches, hopefully by 10-20%, which would bring the total number of days across batches from 16 days to 14d, or from 100 h to 80 hours.
# Approvals required
This proposal needs approval from TPA team members, but service admins can request additional delay if they are worried about their service being affected by the upgrade.
Comments or feedback can be provided in issues linked above, or the general process can be commented on in issue [tpo/tpa/team#41990][].
# References
* [Debian 13 trixie upgrade milestone][] * [discussion ticket][tpo/tpa/team#41990]
[TPA bookworm upgrade procedure]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/upgrades/bookworm [tpo/tpa/team#41990]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41990