maintenance: Gitaly migration to improve GitLab performance (TPA-RFC-89)

Summary: migrate all Git storage to the new `gitaly-01` back-end, each Git repository read-only during its migration, in the coming week. **Table of Contents** - Proposal - Affected projects - alpha phase, day one (2025-07-14) - beta phase, day two (2025-07-15) - production phase, day two or three (2025-07-15+) - Objections and exceptions - Impact - Projects read-only during migration - Additional complexity for TPA - Timeline - Hardware - Background - Alternatives considered - Full read-only backups - Partial or on-demand migration - References and discussions # Proposal Move all Git repositories to the new [Gitaly][] server during Week 28, progressively, which means it will be impossible to push new commits to a repository while it is migrated. This should be a series of short, scoped outage, as each repository is marked as read-only one at a time when it's migrated. The Gitaly migration procedure seems well test and robust, as each repository is checkedsummed before and after migration. We are hoping this will improve overall performance on the GitLab server, and is part of the design upstream GitLab suggests in scaling an installation of our size. ## Affected projects We plan on migrating the following name spaces in order: ### alpha phase, day one (2025-07-14) This is mostly dogfooding and automation: 1. `anarcat` (already done) 2. `tpo/tpa` 3. `tpo/web` ### beta phase, day two (2025-07-15) This is to include testers outside of TPA yet on projects that are less mission critical and could survive *some* issues with their Git repositories. 4. `tpo/community` 5. `tpo/onion-services` 6. `tpo/anti-censorship` 7. `tpo/network-health` ### production phase, day two or three (2025-07-15+) This is essentially all remaining projects: 8. `tpo/core` (includes c-tor and Arti!) 9. `tpo/applications` (includes Tor Browser and Mullvad Browser) 10. all remaining projects ### Objections and exceptions If you do not want any such disruption in your project, please let us know before the deadline (2025-07-15) so we can skip your project. But we would rather migrate *all* projects off of the server to simplify the architecture and better understand the impact of the change. We would like, in particular, to migrate all of `tpo/applications` repositories in the coming week. Inversely, if you want your project to be prioritized (it might mean a performance improvement!), let us know and you can jump the queue! ## Impact ### Projects read-only during migration While a project is migrated, it is "read-only", that is no change can be done to the Git repository. We believe that other features in projects (like issues and comments) should still work, but the [upstream documentation][] on this is not exactly clear:
To ensure data integrity, projects are put in a temporary read-only state for the duration of the move. During this time, users receive a The repository is temporarily read-only. Please try again later. message if they try to push new commits.
So far our test migrations have been so fast (a couple of seconds per project) that we have not really been able to test this properly. ### Additional complexity for TPA TPA will need to get familiar with this new service. [Installation documentation][] is available and all the code developed to deploy the service is visible in an [internal merge request][]. I understand this is a big change right before going on vacation, so any TPA member can veto this and switch to the alternative, a partial or on-demand migration. ## Timeline We plan on starting this work on July 15th, the coming Tuesday. ## Hardware Like the current git repositories on `gitlab-02` the git repositories on `gitaly-01` will be hosted on NVMe disks. # Background GitLab has been having performance problems for a long time now. And for almost as long, we've had the project to "scale GitLab to 2,000 users" ([tpo/tpa/team#40479][]). And while we believe bots (and now, in particular Large Language Models (LLM) bot nets) are responsible for a lot of that load, our [last performance incident][] concluded by observing that there seems to be a correlation between real usage and performance issues. Indeed, during the July break, GitLab's performance was stellar and, on Monday, as soon as Europe woke up from the break, GitLab's performance collapsed again. And while it's *possible* that bots are driven by the same schedule as Tor people, we now feel it's simply time to scale the resources associated with one of our most important services. Gitaly is GitLab's implementation of a Git server. It's basically a web interface to translate (GRPC) requests into Git. It's currently running on the same server as the main GitLab app, but a new server has been built. New servers could be built as needed as well. Anarcat performed [benchmarks][] showing equivalent or better performance of the new Gitaly server, even when influenced by the load of the current GitLab server. It is expected the new server should reduce the load on the main GitLab server, but it's not clear by how much just yet. We're hoping this new architecture will give us more flexibility to deploy new such backends in the future and isolate performance issues to improve diagnostics. It's part of the normal roadmap in scaling a large GitLab installation such as ours. # Alternatives considered ## Full read-only backups We have considered performing a full backup of the entire git repositories before the migration. Unfortunately, this would require setting a read-only mode on all of GitLab for the duration of the backup which, according to our test, could take anywhere from 20 to 60 minutes, which seemed like an unacceptable downtime. Note that we have nightly backups of the GitLab server of course, which is also backed by RAID-10 disk arrays on two different servers. We're only talking about a fully-consistent Git backup here, our normal backups (which, rarely, can be inconsistent and require manual work to reconnect some refs) are typically sufficient anyways. See [tpo/tpa/team#40518][] for a discussion on GitLab backups. ## Partial or on-demand migration We have also considered doing a more piecemeal approach and just migrating some repositories. We worry that this approach would lead to confusion about the real impact of the migration. Still, if any TPA member feels strongly enough about this to put a veto on this proposal, we can take this path and instead migrate a few repositories instead. We could, for example, migrate only the "alpha" targets and a few key repositories in the `tpo/applications` and `tpo/core` groups (since they're prime crawler targets), and leave the mass migration to a later time, with a longer test period. # References and discussions See the [discussion issue][] for comments and more background. [discussion issue]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42225 [Gitaly]: https://gitlab.com/gitlab-org/gitaly [upstream documentation]: https://docs.gitlab.com/api/project_repository_storage_moves/ [last performance incident]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42152 [Installation documentation]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/gitlab#gitaly [internal merge request]: https://gitlab.torproject.org/tpo/tpa/puppet-control/-/merge_requests/89 [tpo/tpa/team#40479]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40479 [benchmarks]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42225#note_3223245 [tpo/tpa/team#40518]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40518 -- Antoine Beaupré torproject.org system administration

On 14/07/2025 17:19, Antoine Beaupré via tor-project wrote:
Move all Git repositories to the new [Gitaly][] server during Week 28, progressively, which means it will be impossible to push new commits to a repository while it is migrated.
This should be a series of short, scoped outage, as each repository is marked as read-only one at a time when it's migrated. The Gitaly migration procedure seems well test and robust, as each repository is checkedsummed before and after migration. Sounds cool, I hope it goes smoothly!
Do you know roughly what time of day (UTC) this will happen? In particular for tpo/applications/tor-browser. Similarly, for its forks. Also, do you have an idea for the downtime duration and whether this scales with the repository size? I guess you might know more after the alpha phase. -henry
participants (2)
-
Antoine Beaupré
-
Henry Wilkes