[tor-bugs] #25161 [Metrics/CollecTor]: Fix any memory problem caused by number of log files to be imported by the webstats module (was: Fix another memory problem with the webstats bulk import)

Wed Feb 21 09:12:15 UTC 2018

#25161: Fix any memory problem caused by number of log files to be imported by the
webstats module
-------------------------------+-----------------------------------
 Reporter:  karsten            |          Owner:  iwakeh
     Type:  defect             |         Status:  needs_information
 Priority:  Medium             |      Milestone:
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+-----------------------------------

Old description:

> I'm running a modified CollecTor that sanitizes webstats with some tweaks
> towards bulk importing existing webstats. In particular, it reads files
> in slices of 10 MiB plus another 2 MiB that overlap with the next slice.
> I just pushed the changes [https://gitweb.torproject.org/karsten/metrics-
> db.git/commit/?h=webstats-
> slices&id=6c1a266d726ceedf4518e7521a6ed58a6764aca4 here].
>
> First of all, the runtime is okay. Not great, but okay. It takes 36
> minutes to sanitize 10 MiB. We have 927 MiB of files, so 93 slices, which
> is going to take ~2.5 days.
>
> However, I ran into an out-of-memory problem at the 6th slice:
>
> {{{
> 2018-02-06 13:30:36,499 INFO o.t.c.w.SanitizeWeblogs:116 Processing 20
> logs for dist.torproject.org on archeotrichon.torproject.org.
> 2018-02-06 13:40:28,968 ERROR o.t.c.c.CollecTorMain:71 The webstats
> module failed: null
> java.lang.OutOfMemoryError: null
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at
> java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
>         at
> java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
>         at
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
>         at
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>         at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>         at
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
>         at
> org.torproject.collector.webstats.SanitizeWeblogs.findCleanWrite(SanitizeWeblogs.java:127)
>         at
> org.torproject.collector.webstats.SanitizeWeblogs.startProcessing(SanitizeWeblogs.java:91)
>         at
> org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:67)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
> limit
>         at
> java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
>         at java.lang.StringCoding.encode(StringCoding.java:344)
>         at java.lang.StringCoding.encode(StringCoding.java:387)
>         at java.lang.String.getBytes(String.java:958)
>         at
> org.torproject.descriptor.log.LogDescriptorImpl.collectionToBytes(LogDescriptorImpl.java:119)
>         at
> org.torproject.descriptor.log.WebServerAccessLogImpl.<init>(WebServerAccessLogImpl.java:72)
>         at
> org.torproject.collector.webstats.SanitizeWeblogs.storeSanitized(SanitizeWeblogs.java:147)
>         at
> org.torproject.collector.webstats.SanitizeWeblogs.lambda$findCleanWrite$3(SanitizeWeblogs.java:127)
>         at
> org.torproject.collector.webstats.SanitizeWeblogs$$Lambda$38/1233367077.accept(Unknown
> Source)
>         at
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
>         at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>         at
> java.util.concurrent.ConcurrentHashMap$EntrySpliterator.forEachRemaining(ConcurrentHashMap.java:3606)
>         at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>         at
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
>         at
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
>         at
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
>         at
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
>         at
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
>         at
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
> }}}
>
> I didn't look very closely, but I believe we're running out of memory
> while writing a sanitized file to disk, in particular while converting a
> list of strings to a byte array that we're then compressing and writing
> to disk. If this is the case, can we avoid creating that second "copy" of
> the file in memory and write lines to the file directly?
>
> Or is this just the operation where we happen to run out of memory from
> accumulating stuff over time, and where fixing this issue would just mean
> that we're failing somewhere else, shortly after?
>
> Should I rather let each module run sanitize a single slice and then
> exit, store which slice has been processed, and run CollecTor in an
> endless loop? Or something like that, but something that collects all the
> garbage between slices?
>
> (Note that I still need to check the output and whether that looks okay
> across slices. Doing that now, unrelated to the issue at hand.)

New description:

 This ticket should enable webstats module to import any number (within
 reasonable bounds) of logfiles, that are below the 2G limit (cf.
 comment:12).
 The handling of larger logs is treated in #25317.

 ------
 Inititial report:
 I'm running a modified CollecTor that sanitizes webstats with some tweaks
 towards bulk importing existing webstats. In particular, it reads files in
 slices of 10 MiB plus another 2 MiB that overlap with the next slice. I
 just pushed the changes [https://gitweb.torproject.org/karsten/metrics-
 db.git/commit/?h=webstats-
 slices&id=6c1a266d726ceedf4518e7521a6ed58a6764aca4 here].

 First of all, the runtime is okay. Not great, but okay. It takes 36
 minutes to sanitize 10 MiB. We have 927 MiB of files, so 93 slices, which
 is going to take ~2.5 days.

 However, I ran into an out-of-memory problem at the 6th slice:

 {{{
 2018-02-06 13:30:36,499 INFO o.t.c.w.SanitizeWeblogs:116 Processing 20
 logs for dist.torproject.org on archeotrichon.torproject.org.
 2018-02-06 13:40:28,968 ERROR o.t.c.c.CollecTorMain:71 The webstats module
 failed: null
 java.lang.OutOfMemoryError: null
         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
         at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
         at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
         at
 java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
         at
 java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
         at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
         at
 java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
         at
 java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
         at
 java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
         at
 java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
         at
 org.torproject.collector.webstats.SanitizeWeblogs.findCleanWrite(SanitizeWeblogs.java:127)
         at
 org.torproject.collector.webstats.SanitizeWeblogs.startProcessing(SanitizeWeblogs.java:91)
         at
 org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:67)
         at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
         at
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
         at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
 limit
         at
 java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
         at java.lang.StringCoding.encode(StringCoding.java:344)
         at java.lang.StringCoding.encode(StringCoding.java:387)
         at java.lang.String.getBytes(String.java:958)
         at
 org.torproject.descriptor.log.LogDescriptorImpl.collectionToBytes(LogDescriptorImpl.java:119)
         at
 org.torproject.descriptor.log.WebServerAccessLogImpl.<init>(WebServerAccessLogImpl.java:72)
         at
 org.torproject.collector.webstats.SanitizeWeblogs.storeSanitized(SanitizeWeblogs.java:147)
         at
 org.torproject.collector.webstats.SanitizeWeblogs.lambda$findCleanWrite$3(SanitizeWeblogs.java:127)
         at
 org.torproject.collector.webstats.SanitizeWeblogs$$Lambda$38/1233367077.accept(Unknown
 Source)
         at
 java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
         at
 java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
         at
 java.util.concurrent.ConcurrentHashMap$EntrySpliterator.forEachRemaining(ConcurrentHashMap.java:3606)
         at
 java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
         at
 java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
         at
 java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
         at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
         at
 java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
         at
 java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
         at
 java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
 }}}

 I didn't look very closely, but I believe we're running out of memory
 while writing a sanitized file to disk, in particular while converting a
 list of strings to a byte array that we're then compressing and writing to
 disk. If this is the case, can we avoid creating that second "copy" of the
 file in memory and write lines to the file directly?

 Or is this just the operation where we happen to run out of memory from
 accumulating stuff over time, and where fixing this issue would just mean
 that we're failing somewhere else, shortly after?

 Should I rather let each module run sanitize a single slice and then exit,
 store which slice has been processed, and run CollecTor in an endless
 loop? Or something like that, but something that collects all the garbage
 between slices?

 (Note that I still need to check the output and whether that looks okay
 across slices. Doing that now, unrelated to the issue at hand.)

--

Comment (by iwakeh):

 Chnging summary and title to reflect what is to be done here.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25161#comment:14>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online