Hi everyone,
TPA held its first meeting of the year, and those are the minutes. I'll take the opportunity to wish everyone a happy new year, if you're into that kind of calendar. I know it's not the most obvious thing to do right now, but I wish you can find hope this year.
# Roll call: who's there and emergencies
* anarcat * kez * lavamind
No emergencies.
# Holidays debrief
Holidays went fine, some minor issues, but nothing that needed to be urgently dealt with (e.g. [40569][], [40567][], [commit][], runner bug). Rotation worked well.
[40569]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40569 [40567]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40567 [commit]: https://gitweb.torproject.org/admin/tsa-misc.git/commit/?id=33fb8ded635aa620...
anarcat went cowboy and setup two new nodes before the holidays, which is not great because it's against our general "don't launch on a friday". (It wasn't on a friday, but it was close enough to the holidays to be a significant risk.) Thankfully things worked out fine: one of the runners ended up failing just as lavamind was starting work again last week. (!)
# 2021 roadmap review
## sysadmin
We did a review directly in the wiki page. Notable changes:
* jenkins is marked as completed, as rouyi will be retired this week (!) * the blog migration was completed! * we consider we managed to deal with the day-to-day while still reserving time for the unexpected (e.g. the rushed web migration from Jenkins to GitLab CI) * we loved that team work and should plan to do it again * we were mostly on budget: we had an extra 100EUR/mth at hetzner for a new Ganeti node in the gnt-fsn cluster, and extra costs (54EUR/mth!) for the [Hetzner IPv4 billing changes][], and more for extra bandwidth use
[Hetzner Ipv4 billing changes]: https://docs.hetzner.com/general/others/ipv4-pricing/
## web
Did a review of the 2021 web roadmap (from the [wiki homepage][]), copied below:
[wiki homepage]: https://gitlab.torproject.org/tpo/web/team/-/wikis/home
* [ ] [Donations page redesign][] - 10-50% * [ ] [Improves bridges.torproject.org][] - 80% done! * [ ] [Remove outdated documentation from the header][] - the "docs.tpo ticket", considering using dev.tpo instead, focus on launching dev.tpo next instead * [x] Migrate blog.torproject.org from Drupal To Lektor: it needs a milestone and planning * [x] [Support forum][] * [ ] [Developer portal][] AKA dev.tpo * [x] Get website build from Jenkins into to GitLabCI for the static mirror pool (before December) * [ ] Get up to speed on maintenance tasks: * [ ] Bootstrap upgrade - uh god. * [ ] browser documentation update - what is this? * [ ] get translation stats available - what is this? * [x] rename 'master' branch as 'main' * [ ] fix wiki for documentation - what is this? * [ ] get [onion service tooling][] into TPO GitLab namespace - what is this?
[Donations page redesign]: https://gitlab.torproject.org/groups/tpo/-/milestones/22 [Improves bridges.torproject.org]: https://gitlab.torproject.org/groups/tpo/-/milestones/7 [Remove outdated documentation from the header]: https://gitlab.torproject.org/tpo/web/team/-/issues/8 [Support forum]: https://gitlab.torproject.org/groups/tpo/-/milestones/26 [Developer portal]: https://gitlab.torproject.org/groups/tpo/-/milestones/23 [onion service tooling]: https://gitlab.torproject.org/hiro/roid
# Syadmin+web OKRs for 2022 Q1
We want to take more time to plan for the web team, in particular, and especially focused on this in the meeting.
## web team
We did the following brainstorm. Anarcat will come up with a proposal for a better-formatted OKR set for next week, at which point we'll prioritize this and the sysadmin OKRs for Q1.
* OKR: rewrite of the donate page ([milestone 22][]) * [new lektor frontend][] * [we can donate through the .onion][] * [vanilla JS rewrite][] * OKR: make it easier for translators to contribute * help the translation team to switch to Weblate * it is easier for translators to find their built copy of the website * bring build time to 15 minutes to accelerate feedback to translators * allow the web team to trigger manual builds for reviews * OKR: documentation overhaul: * [launch dev.tpo][] * "Remove outdated documentation from the header", stop pointing to dead docs * come with ideas on how to manage the wiki situation * cleanup the queues and workflow * OKR: resurrect bridge port scan * do not scan private IP blocks * make it pretty
[milestone 22]: https://gitlab.torproject.org/groups/tpo/-/milestones/22 [new lektor frontend]: https://gitlab.torproject.org/tpo/web/donate-static/-/issues/37 [we can donate through the .onion]: https://gitlab.torproject.org/tpo/web/donate-static/-/issues/36 [vanilla JS rewrite]: https://gitlab.torproject.org/tpo/web/donate-static/-/issues/45 [launch dev.tpo]: https://gitlab.torproject.org/tpo/web/dev/-/issues/6
Missed from the last meeting:
* sponsor 9 stuff: collected UX feedback for portals, which involves web to fix issues we found, need to prioritise
We also need to organise with the new people:
* onion SRE: new OTF project USAGM, starting in February * new community person
# Other discussions
# Next meeting
We're going to hold another meeting next week, same time, to review the web OKRs and prioritize Q1.
# Metrics of the month
* hosts in Puppet: 89, LDAP: 91, Prometheus exporters: 139 * number of Apache servers monitored: 27, hits per second: 185 * number of Nginx servers: 0, hits per second: 0, hit ratio: 0.00 * number of self-hosted nameservers: 6, mail servers: 8 * pending upgrades: 7, reboots: 0 * average load: 0.35, memory available: 4.01 TiB/5.13 TiB, running processes: 643 * disk free/total: 84.95 TiB/39.99 TiB * bytes sent: 325.45 MB/s, received: 190.66 MB/s * planned bullseye upgrades completion date: 2024-09-07 * [GitLab tickets][]: 159 tickets including... * open: 2 * icebox: 143 * backlog: 8 * next: 2 * doing: 2 * needs information: 2 * (closed: 2573)
[Gitlab tickets]: https://gitlab.torproject.org/tpo/tpa/team/-/boards
Upgrade prediction graph now lives at:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/upgrades/bullseye/
... with someone accurate values, although the 2024 estimate above should be taken with a grain of salt, as we haven't really started the upgrade at all.
# Number of the month
5. We just hit the 5TiB of deployed memory, kind of neat.
# Another number of the month
0. We have zero Nginx servers left, as we turned off our two Nginx servers (ignoring the Nginx server in the GitLab instance, which is not really monitored correctly), when we migrated the blog to a static site. Those two servers were the caching server sitting in front of the Drupal blog for cost savings. They served us well but are now retired since they are not necessary for the static version.
An error crept up in the Metrics of this month and last, see if you can spot it:
On 2022-01-11 20:34:08, Antoine Beaupré wrote:
# Metrics of the month
- hosts in Puppet: 89, LDAP: 91, Prometheus exporters: 139
- number of Apache servers monitored: 27, hits per second: 185
- number of Nginx servers: 0, hits per second: 0, hit ratio: 0.00
- number of self-hosted nameservers: 6, mail servers: 8
- pending upgrades: 7, reboots: 0
- average load: 0.35, memory available: 4.01 TiB/5.13 TiB, running processes: 643
- disk free/total: 84.95 TiB/39.99 TiB
- bytes sent: 325.45 MB/s, received: 190.66 MB/s
- planned bullseye upgrades completion date: 2024-09-07
- [GitLab tickets][]: 159 tickets including...
- open: 2
- icebox: 143
- backlog: 8
- next: 2
- doing: 2
- needs information: 2
- (closed: 2573)
hint: it's about disk space...
anyone?
credits to roger who figured it out: the disk free/total was backwards. The correct figure should have read:
* disk free/total: 38.28 TiB/84.95 TiB
... in this report. Future report shouldn't have this error. It should also be noted that those metrics should be generally taken with a grain of salt. The disk query was introduced recently and, in particular, counts disk usage of the (huge) backup server (60TiB) which itself keeps a copy of everything by definition.
The network metrics also probably overcount things as we simply do this:
sum(rate(node_network_transmit_bytes_total[30d]))
... which, in the most likely case you are unfamiliar with Prometheus and our network infrastructure, may count traffic twice. This will count internal traffic between network mirrors, for example.
I haven't yet figured out a good (AKA simple) way to fix those queries...
Cheers!
A.
tor-project@lists.torproject.org