Hi,
Today, starting from at least 07:04:22 UTC, mail notifications sent from GitLab have been either delayed or dropped (unclear, probably the latter).
This means that if you rely on GitLab notifications to order your work, you will very likely need to login to GitLab and look at your issues. A good way to catch up is look at the latest notifications in your "To Do" list in:
https://gitlab.torproject.org/dashboard/todos
I am aware that many of you have a humongus and completely useless to do list of death. I am sorry.
As of a 20:50UTC, email delivery has resumed and should be back to normal until further notice.
### Technical details
Busy people or people less interested in technical details can skip the remainder of this email.
It's unclear what happened. We're tracking the issue in this ticket:
https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/139
It looks like the regression was caused by the GitLab 15.9.3 to 15.10.0 upgrade because that upgrade completed at 06:35:53, half an hour before the first email got lost.
It's also unclear if GitLab queued up those emails and is sending them now, but I suspect it just dropped them. I couldn't find the right log file in the thousands (literally) of log files GitLab keeps, so it's really hard to tell.
*Why* this regression happened is simply beyond me. Me and kez poured the best of both of our brains to figure out why, suddenly, GitLab decided to not only use STARTTLS to connect to the local SMTP server (which it was specifically told not to do) but *also* validate the certificate (which it was *also* told not to do). We currently use a bespoke CA for local SMTP servers and, naturally, that certificate doesn't verify. And obviously setting the correct CA in GitLab's settings doesn't work either, because why would anything work at this point.
(Besides, it's unclear how anyone should issue a valid certificate for `localhost` in the first place... ANYWAY.)
I filed this issue upstream:
https://gitlab.com/gitlab-org/gitlab/-/issues/399241
I am unsure how this issue is going to go, or how long this fix is going to last, it's all quite obscure.
(This is why, by the way, we rarely try to patch GitLab. The code base is byzantine at best, they ship their own Rails, Ruby, PostgreSQL, Prometheus, Grafana (which is itself a special clusterfuck of deps), Chef (!), and I won't bore with with the rest of the list: it's a total mess, and it takes hours just to get your bearings to get anything done at all. In this specific case, we completely gave up in patching what should be a simple Rails app.)
So anyway. Fixed I guess?
A.