[tor-bugs] #25100 [Metrics/CollecTor]: Make CollecTor's webstats module use less RAM and CPU time

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Feb 1 09:21:23 UTC 2018


#25100: Make CollecTor's webstats module use less RAM and CPU time
-------------------------------+--------------------------------
 Reporter:  karsten            |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_revision
 Priority:  High               |      Milestone:
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+--------------------------------

Comment (by iwakeh):

 Replying to [comment:8 karsten]:
 > Replying to [comment:7 iwakeh]:
 > > True, so far we didn't trade memory for time, but got some
 improvements that could be picked easily even winning some time here.
 > > Keeping counts of different sanitized lines in memory could also help
 and might be only a small change; I'm looking into this next.
 >
 > Aha! That sounds very promising, too. Maybe even leave out the date part
 from sanitized lines and keep a bag of dates containing sanitized lines.
 Something like `Map<String, Bag<LocalDate>>` (yes, I know that there's no
 `Bag` type in Java; time to add Apache Commons Collections?). And later
 when we write sanitized logs, we simply put in the date.

 Depending on the target scenarios it might be also very fruitful and a
 reusable approach for other CollecTor modules, not no implement
 'compression' (which the above is) by hand, but rather use some in-memory
 database that compresses the highly redundant data at hand.  Reasoning:
 the above mentioned 8867 logs from weschniakowsky and meronense combined
 are just 60M when xz compressed and roughly 20G (plus/minus x) deflated.
 If the in-memory db achieves a compression about ten times less efficient
 than xz, still only 600M were needed.  Plus we'd get some sql (like) query
 support in addition.

 If it works, we'd have a useful approach to recycle widely in metrics'
 code base.

 Thoughts?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25100#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list