[tor-bugs] #31690 [Internal Services/Service - trac]: study trac.torproject.org archival possibilities

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed May 13 14:44:42 UTC 2020


#31690: study trac.torproject.org archival possibilities
----------------------------------------------+---------------------
 Reporter:  anarcat                           |          Owner:  qbi
     Type:  project                           |         Status:  new
 Priority:  Medium                            |      Milestone:
Component:  Internal Services/Service - trac  |        Version:
 Severity:  Normal                            |     Resolution:
 Keywords:  tickets-migration                 |  Actual Points:
Parent ID:  #30857                            |         Points:
 Reviewer:                                    |        Sponsor:
----------------------------------------------+---------------------

Old description:

> this is a split out of #30857 to discuss specifically the question of
> if/how to archive trac.torproject.org.
>
> As mentioned in that ticket, there are a few options on how to deal with
> trac, provided we have another system we want to use:
>
>  1. '''the golden redirect set''': every migrated ticket and wiki page
> has a corresponding ticket/wiki page in GitLab and a gigantic set of
> redirection rules makes sure they are mapped correctly. probably
> impractical, but solves the maintenance problem possibly forever.
>
>  2. '''read-only Trac''': user creation is disabled and existing users
> are locked from making any change to the site. only a temporary or
> intermediate measure.
>
>  3. '''fossilization''': Trac is turned into a static HTML site that can
> be mirrored like any other site. can be a long term solution and a good
> compromise with a possibly impossible to design and therefore failing
> (because incomplete) set of redirection rules.
>
>  4. '''destruction''': we hate the web and pretend link rot is not a
> problem and just get rid of the old site, assuming everything is migrated
> and people will find their stuff eventually. probably not an option.
>
> == Archive team work
>
> With my archive team hat, I was able to coordinate a first archival of
> the website during the summer of 2019, as documented in #30857. This is
> an attempt at doing "3. '''fossilization'''".
>
> All those jobs end up populating the wayback machine at web.archive.org,
> but are also available as WARC files, an archival format for web pages.
>
> A first archival of all tickets up to #30856 has been performed here:
>
> https://archive.fart.website/archivebot/viewer/job/5vytc
>
> It's about 600MB of compressed HTML (more or less).
>
> Then a full archival job of the entire site was performed here:
>
> https://archive.fart.website/archivebot/viewer/job/bpu6j
>
> It created about 10GB of WARC files, crawled over 730,000 links
> (including external sites linked from Trac) and 105.34GiB of data. It
> took over 5 days:
>
> {{{
> 2019-06-17 01:49:02,514 - wpull.application.tasks.stats - INFO -
> Duration: 5 days, 7:32:55. Speed: 0.0 B/s.
> 2019-06-17 01:49:02,514 - wpull.application.tasks.stats - INFO -
> Downloaded: 732488 files, 105.4 GiB.
> }}}
>
> == Other statistics
>
> Archiving the server itself means dealing with:
>
>  * ~1GB of attachments
>  * 4GB PostgreSQL database
>
> The actual server uses around 25GB of disk space because of random junk
> here and there but that's the very minimum it can be trimmed down to.
> naturally, we can keep *that* data forever, the problem is keeping the
> app running on top of that... That would be some incarnation of "4.
> '''destruction'''".

New description:

 this is a split out of #30857 to discuss specifically the question of
 if/how to archive trac.torproject.org.

 As mentioned in that ticket, there are a few options on how to deal with
 trac, provided we have another system we want to use:

  1. '''the golden redirect set''': every migrated ticket and wiki page has
 a corresponding ticket/wiki page in GitLab and a gigantic set of
 redirection rules makes sure they are mapped correctly. probably
 impractical, but solves the maintenance problem possibly forever.

  2. '''read-only Trac''': user creation is disabled and existing users are
 locked from making any change to the site. only a temporary or
 intermediate measure.

  3. '''fossilization''': Trac is turned into a static HTML site that can
 be mirrored like any other site. can be a long term solution and a good
 compromise with a possibly impossible to design and therefore failing
 (because incomplete) set of redirection rules.

  4. '''destruction''': we hate the web and pretend link rot is not a
 problem and just get rid of the old site, assuming everything is migrated
 and people will find their stuff eventually. probably not an option.

  5. '''redirect to the wayback machine''': like '''fossilization''', but
 delegate to the internet archive and hope for the best

 == Archive team work

 With my archive team hat, I was able to coordinate a first archival of the
 website during the summer of 2019, as documented in #30857. This is an
 attempt at doing "3. '''fossilization'''".

 All those jobs end up populating the wayback machine at web.archive.org,
 but are also available as WARC files, an archival format for web pages.

 A first archival of all tickets up to #30856 has been performed here:

 https://archive.fart.website/archivebot/viewer/job/5vytc

 It's about 600MB of compressed HTML (more or less).

 Then a full archival job of the entire site was performed here:

 https://archive.fart.website/archivebot/viewer/job/bpu6j

 It created about 10GB of WARC files, crawled over 730,000 links (including
 external sites linked from Trac) and 105.34GiB of data. It took over 5
 days:

 {{{
 2019-06-17 01:49:02,514 - wpull.application.tasks.stats - INFO - Duration:
 5 days, 7:32:55. Speed: 0.0 B/s.
 2019-06-17 01:49:02,514 - wpull.application.tasks.stats - INFO -
 Downloaded: 732488 files, 105.4 GiB.
 }}}

 == Other statistics

 Archiving the server itself means dealing with:

  * ~1GB of attachments
  * 4GB PostgreSQL database

 The actual server uses around 25GB of disk space because of random junk
 here and there but that's the very minimum it can be trimmed down to.
 naturally, we can keep *that* data forever, the problem is keeping the app
 running on top of that... That would be some incarnation of "4.
 '''destruction'''".

--

Comment (by anarcat):

 a conversation is happening over tor-internal about which option to choose
 between fossilization (3), destruction (4) and "just use the internet
 archive (added as option 5) in the above list. so far (3) has two votes,
 but I've warned about the complexity of the problem and personally favor
 option 5 at this stage.

 the golden redirect set (option 1) has been ruled out for now because the
 gitlab migration has stalled and we are unsure that we can actually
 reliably migrate those tickets.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31690#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list