[tor-project] minutes from the sysadmin meeting

7 Oct 2019

      Hello!

Here are the minutes from this month's sysadmin meeting.

# Roll call: who's there and emergencies

anarcat, hiro, ln5, qbi and weasel are here.

# What has everyone been up to

## anarcat

 * announced LDAP sudo transition plan ([#6367][])
 * finished first phase of the hiera transition ([#30020][])
 * deployed trocla in test ([#30009][])
 * coordinate textile shutdown ([#31686][])
 * announced jabber service shutdown ([#31700][])
 * closed snowflake -> TPA transition ticket for now, external
   monitoring is sufficient ([#31232][])
 * improvements on grafana dashboards
 * gitlab, nextcloud transitions coordination and oversight
 * ooni.tpo to ooni.io transition coordination ([#31718][])
 * bugtracking on networking issues ([#31610][], [#31805][],
   [#31916][])
 * regular janitorial work (security upgrades, reboots, crashes, disk
   space management, etc)
 * started needrestart deployment to reduce that work ([#31957][])
 * completed the "reports card" questionaire ([#30881][])
 * continued work on the upstream prometheus module
 * tested puppetboard as a Puppet Dashboard ([#31969][])

[#6367]: https://bugs.torproject.org/6367
[#30009]: https://bugs.torproject.org/30009
[#30020]: https://bugs.torproject.org/30020
[#31232]: https://bugs.torproject.org/31232
[#31686]: https://bugs.torproject.org/31686
[#31700]: https://bugs.torproject.org/31700
[#31718]: https://bugs.torproject.org/31718
[#31610]: https://bugs.torproject.org/31610
[#31805]: https://bugs.torproject.org/31805
[#31916]: https://bugs.torproject.org/31916
[#31969]: https://bugs.torproject.org/31969

## weasel

 * Started with new onionoo hosts.  Currently there's just one backend
   on fsn, irl is doing the service part (cf. #31659)
 * puppet cleanup: nameserver/hoster info
 * new static master on fsn
 * staticsync and bacula puppet cleanups/major-rework/syncs with
   debian
 * new fsn web frontends.  only one is currently rotated
 * retire togashii, started retiring saxatile
 * moved windows VM away from textile
 * random updates/reboots/fixes
 * upgraded polyanthum to Debian 10

## Hiro

 * Setup dip so that it can be easily rebased with debian upstream
 * Migrated gettor from getulum to gettor-01
 * Random upgrades and reboots
 * Moving all my services to ansible or packages (no ad - hoc
   configuration):
     - Gettor can be deployed and updated via ansible
     - Survey should be deployed and updated via ansible
     - Gitlab (dip) is already on ansible
     - Schleuder should be maintained via packages
 * Nagios checks for gettor

## ln5

Didn't do much. :(

## qbi

Didn't do volunteering due to private stuff

# What we're up to next

## anarcat

New:

 * LDAP sudo transition ([#6367][])
 * jabber service shutdown ([#31700][])
 * considering unattended-upgrades or at least automated needrestart
   deployment ([#31957][])
 * followup on the various ops report card things ([#30881][])
 * maybe deploy puppetboard as a Puppet Dashboard ([#31969][]),
   possibly moving puppetdb to a separate machine
 * nbg1/prometheus stability issues, ipsec seems to be the problem
   ([#31916][])

Continuing/stalled:

 * director replacement ([#31786][])
 * taking a break on hiera refactoring ([#30020][])
 * send root@ emails to RT ([#31242][])
 * followup with email services improvements ([#30608][])
 * continue prometheus module merges
 * followup on SVN decomissionning ([#17202][])

[#17202]: https://bugs.torproject.org/17202
[#30608]: https://bugs.torproject.org/30608
[#30881]: https://bugs.torproject.org/30881
[#31242]: https://bugs.torproject.org/31242
[#31232]: https://bugs.torproject.org/31232
[#31786]: https://bugs.torproject.org/31786

## weasel

 * more VMs should move to gnt-fsn
 * more VMs should be upgraded
 * maybe get some of the pg config fu from dsa-puppet since the 3rd
   party pg module sucks

## Hiro
 * Nagios checks for bridgedb
 * decommissioning getulum
 * ansible recipe to manage survey.tp.o
 * dev portal coding in lektor
 * finishing moving gettor to gettor-01 includes gettor-web via lektor
 * do usual updates and rebots

## ln5

Nextcloud migration.

# Other discussions

## configuration management systems

We discussed the question of the "double tools problem" that seems to
be coming up with the configuration management system: most systems
are managed with Puppet, but some services are deployed with
Puppet. It was argued it might be preferable to use Puppet everywhere
to ease onboarding, since it would be one less tool to learn. But that
might require giving people root, or managing services ourselves,
which is currently out of the question. So it was agreed it's better
to have services managed with ansible than not managed at all...

# Next meeting

We're changing the time because 1400UTC would be too early for anarcat
because of daylight savings. We're pushing to 1500UTC, which is
1600CET and 1000EST.

# Metrics of the month

Access and transfer rates are an average over the last 15 days.

 * hosts in Puppet: 79, LDAP: 82, Prometheus exporters: 106
 * number of apache servers monitored: 26, hits per second: 177
 * number of self-hosted nameservers: 4, mail servers: 10
 * pending upgrades: 0, reboots: 0
 * average load: 0.51, memory available: 318.82 GiB/871.81 GiB,
   running processes: 379
 * bytes sent: 134.28 MB/s, received: 94.38 MB/s

Now also available as the main Grafana dashboard. Head to
<https://grafana.torproject.org/>, change the time period to 7 days,
and wait a while for results to render.

Note that the retention period of the Prometheus server has been
reduced from 30 to 15 days to address stability issues with the server
(ticket #31916), so far without luck.

-- 
Antoine Beaupré
torproject.org system administration