[tor-bugs] #29697 [Internal Services]: archive.tpo is soon running out of space

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue May 21 14:08:42 UTC 2019


#29697: archive.tpo is soon running out of space
-------------------------------+------------------------
 Reporter:  boklm              |          Owner:  (none)
     Type:  defect             |         Status:  new
 Priority:  Medium             |      Milestone:
Component:  Internal Services  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+------------------------

Comment (by anarcat):

 TL;DR: possible paths:

  1. Internet Archive (IA)
  2. Software Heritage
  3. commercial storage (e.g. Amazon Glacier)
  4. host our own
  5. spend more time deciding on archival policies
  6. mix of the above

 One way to manage stuff like this is to break it up in smaller pieces and
 distribute it around. a typical way I manage those archives is with git-
 annex, which allows for reliable tracking of N copies (say "3 redundant
 copies") and supports *many* different "remotes", including Amazon
 Glacier, Internet Archive (IA) and so on. It's what I used in the Brazil
 archival project and it mostly worked. It's hard to use, unfortunately,
 which may be a big blocker for adoption.

 If git-annex is too complicated, we can talk to IA directly. I would
 recommend, however, against using their web-based upload interface which,
 even they acknowledge, is terrible and barely useable. I packaged the
 [https://tracker.debian.org/pkg/python-internetarchive internetarchive]
 python client in Debian to work around that problem and it works much
 better.

 Moving files to IA only shifts the problem, in my opinion: then we have
 only a single copy, elsewhere and while we don't need to manage that space
 anymore, we also don't manage backups and will never know if they drop
 stuff on us (and they do, sometimes, either deliberately or by mistake). I
 would propose that if stuff moves out of our "backed-up" infrastructure,
 it should be stored in at least two administratively distinct locations.

 Another such location we could use, apart from commercial providers like
 Amazon, is the [http://softwareheritage.org/ Software Heritage] project
 ([https://en.wikipedia.org/wiki/Software_Heritage WP]) which is *designed*
 to store copies of source code and software artifacts of all breeds. It
 might already have something for Tor even.

 Otherwise, assuming we can solve this problem ourselves, I think this
 question boils down to "How big of an archive do we actually need and how
 fast does it grow?" With the limited Grafana history I had  available a
 week ago, I have calculated we dump roughly ~10GB per week of new stuff on
 there, but naturally the sample size is too small to take that number
 seriously. To give you another metric, in the last two weeks now (one week
 later), we have gone from 254GB to 207GB, eating a whopping 47GB in 15
 days, which clocks the rate at ~3GB a day or ~24GB a week. When I looked
 at it a week ago, we had 220GB left, which gives us a rate of 13GB/week,
 so I would estimate the burn rate is between 10 to 20GB/week, which gives
 us about 10 to 20 weeks to act on this problem.

 Assuming 10GB/week, this means we need ~500GB of *new* storage every year.
 In our current capacity, this trickles into roughly 2x1TB of storage per
 year because of RAID and backups.

 So if we want this problem to go away for ~10 years (assuming current
 rate, which is probably inaccurate, at beast), we could throw hardware at
 the problem and give Hetzner another ~200EUR/mth specifically for an
 archival server. We might be able to save some costs by *not* backing up
 the server and using IA/Software Heritage as a fallback, with git-annex as
 well.

 Fundamentally, this is a cost problem. Do you want us to spend time to
 figure out a proper archival policy and cheap/free storage locations or
 pay for an archival server?

 In any case, I'd be happy to dig deeper into this to figure out the
 various options beyond the above napkin calculations.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29697#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list