[metrics-team] duplicates in collector tarballs?

Karsten Loesing karsten at torproject.org
Thu Jun 16 11:51:07 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 15/06/16 23:43, tl wrote:
>> 
>> On 15.06.2016, at 15:07, Karsten Loesing <karsten at torproject.org>
>> wrote:
>> 
>> Signed PGP part Hi Thomas,
>> 
>> my first guess is that you're looking at a different timestamp
>> than CollecTor for deciding which tarball a descriptor belongs
>> in.
> 
> I’m using getPublishedMillis() in most cases, except Consensus -
> getValidAfterMillis() Torperf - getStartMillis() Tordnsel -
> getDownloadedMillis() What is CollecTor using?

Off the top of my head, those three are correct.  Plus, votes would be
sorted by getValidAfterMillis(), too.

>> Unfortunately, "relays 2007-08 and 2007-09" is rather vague,
>> because relays published all kinds of descriptors in those two
>> months, and I can't really look at all those tarballs right now.
> 
> Sorry, my bad. That’s the short name I used internally for 
> server-descriptors-2007-09.tar.xz 
> server-descriptors-2007-08.tar.xz
> 
> 
>> Can you list a tarball and a file contained in that tarball which
>> you think doesn't belong there?
> 
> Converting server-descriptors-2007-09.tar.xz I get 3 results:
> Relay_2007-08.json, Relay_2007-09.json and Relay_2007-10.json. I’m
> attaching the latter:
> 
> 
> 
> 
> 
> Both descriptors are from October 1. early in the morning.

Yep, you're right, that looks bad.  I wrote a small Java application
to parse through all tarballs and tell me which of them contain
descriptors that don't belong there.  I'm attaching the sources, FYI.

I'll create a ticket as soon as I have a better sense of what's going
wrong.  But server-descriptors-2007-09.tar.xz looks indeed problematic.

> And I’m also thinking if I shouldn't just use the date of the
> tarball that contains the descriptors. I hadn’t expected any
> problems here so I went for the (easily reachable) dates in the
> descriptors but it seems safest to just reproduce CollecTor
> tarballs as faithful as possible no matter how the descriptors were
> allocated. Especially since the situation get’s even more complex
> with Consensus, Torperf and Tordnsel. I just don’t know how exactly
> I could get hold of the name of the tarball that the descriptor is
> extracted from. Seems like metrics-lib.DescriptorReader doesn’t
> provide the name of the tarball it’s reading. Can you do something
> about that.

You should be able to learn that via DescriptorFile.  See the Javadocs
there.

> Ciao Thomas

All the best,
Karsten



> 
> 
> 
> 
> 
> 
> 
> 
> 
>> All the best, Karsten
>> 
>> 
>> On 14/06/16 11:33, tl wrote:
>>> 
>>>> On 14.06.2016, at 11:27, tl <tl at rat.io> wrote:
>>>> 
>>>>> 
>>>>> On 14.06.2016, at 10:05, Karsten Loesing 
>>>>> <karsten at torproject.org> wrote:
>>>>> 
>>>>> Signed PGP part Hi Thomas,
>>>>> 
>>>>> can you give one or more examples?
>>>> 
>>>> Unfortunately I didn’t keep note of them. When I couldn’t
>>>> convert all descriptors of one type in one run (because I ran
>>>> into memory limits) I converted descriptors per year. Maybe
>>>> in 20% of these cases I got results like this:
>>>> 
>>>> -rwxrwxrwx 1 t t    1978191 Jun 14 03:23 
>>>> RelayVote_2015-12.parquet.snappy -rwxrwxrwx 1 t t 2473316989
>>>> Jun 14 03:23 RelayVote_2016-01.parquet.snappy -rwxrwxrwx 1 t
>>>> t 2384211448 Jun 14 03:23 RelayVote_2016-02.parquet.snappy 
>>>> -rwxrwxrwx 1 t t 2265386311 Jun 14 03:23 
>>>> RelayVote_2016-03.parquet.snappy -rwxrwxrwx 1 t t 2339112076
>>>> Jun 14 03:23 RelayVote_2016-04.parquet.snappy -rwxrwxrwx 1 t
>>>> t 2062026086 Jun 14 03:23 RelayVote_2016-05.parquet.snappy
>>>> 
>>>> where I had only converted tarballs of 2016.
>>>> 
>>>> 
>>>> I had similar issues when I converted tarballs from another
>>>> year but I don’t remember for sure which type and which year.
>>>> I think (!) it relays for 2012-08 and 2012-09 so it’s not
>>>> only an issue with years ends. It seems like my JSON
>>>> converter handles this issue differently than my Parquet
>>>> converter. The JSON converter didn’t run into memory issues
>>>> and seems to be happy to append to data already written to
>>>> disk. The Parquet converter otoh often (but not always :-/)
>>>> keeps everything in memory and only in the very last step
>>>> writes everything to disk in one flush. Then sometimes the
>>>> results for one or two months remain completely empty and my
>>>> current guess would be that in those cases there was an
>>>> overlap of descriptors in tarballs from different months and 
>>>> the converter couldn’t decide which one to write out. The
>>>> two months mentioned above where such a case and when I then 
>>>> converted sepoerately I got results also for the month
>>>> 2012-07 and 2012-10. But again: I’m neither sure about the
>>>> year nor the type of descriptor. I would have to rerun
>>>> conversions and search for them. Should I?
>>> 
>>> Ha, found them in the bash-history: relays 2007-08 and 2007-09
>>> 
>>> c’t
>>> 
>>> 
>>>> Ciao Thomas
>>>> 
>>>> 
>>>>> All the best, Karsten
>>>>> 
>>>>> 
>>>>> On 13/06/16 22:19, tl wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> when testing some descriptor converter I stumbled across
>>>>>> the fact that descriptor tarballs for a given month
>>>>>> sometimes contain a few descriptors from the month before
>>>>>> or after. That introduces a problem that I might be able
>>>>>> to overcome by poking at the code but before I try that
>>>>>> I’d like to know: - if a descriptor tarball for say
>>>>>> 2012-10 also contains descriptors from 2012-09 does that
>>>>>> mean that the 2012-09 descriptors contained in the
>>>>>> 2012-10 tarball are not contained in the 2012-09 tarball?
>>>>>> Or are they duplicates? - and if they are no duplicates:
>>>>>> would it be hard to repackage the tarballs? Tedious for
>>>>>> sure, but hard? Or not good for other reasons?
>>>>>> 
>>>>>> Cheers Thomas
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> metrics-team mailing list
>>>>>> metrics-team at lists.torproject.org 
>>>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>>>>
>>>>>
>>>>>
>>>>>>
>>
>>>>>> 
_______________________________________________
>>>>> metrics-team mailing list
>>>>> metrics-team at lists.torproject.org 
>>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>
>>>>> 
< Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
>>>> 
>>>> _______________________________________________ metrics-team 
>>>> mailing list metrics-team at lists.torproject.org 
>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> 
< Der Siegeszug der Populisten - http://www.stern.de/6880250.html
>>>> 
>>> 
>> 
> 
> 
> 
> 
> 
> 
> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html
> > < Diskurs und Wutbürger -
> http://www.faz.net/aktuell/politik/inland/politik-braucht-eine-sprache-der-maessigung-14281846.html
>
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJXYpKqAAoJEC3ESO/4X7XBAFcH/2fTkmtl4GVimbl1QQT6vRsj
ziD0EHyQ68R3iuEpAtNpsV0G0ItEsn+RyPc+OdKCERFNVD+ulRQP8FJzjzH/IlR9
820eYZhBzs8rb7samdYhZvV6s9J1LT8/YqpHBWrV7DUzREt9iBJOqFLcYh0xNXcY
CFKOPyU9oJ2Iq2pn/+E3CKXsSAnuRM91QoVTKyQ2UtI0Lq4iTfPUScnXicUDB2Fw
zBIQwdvLNtzaCuZjrH+a+zostolZ3Wlw9D7emyZth3pS4eq6NcEwxY6LZ8e/Yxv4
xQYq+oTJ5wDKQy1Xh7xcslVQVikMtL09lvmenCczHmaHO3VVSqOT/N5/+54anXM=
=cvXb
-----END PGP SIGNATURE-----
-------------- next part --------------
package wrongmonth;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Iterator;
import java.util.Locale;
import java.util.TimeZone;

import org.torproject.descriptor.BridgeExtraInfoDescriptor;
import org.torproject.descriptor.BridgeNetworkStatus;
import org.torproject.descriptor.BridgePoolAssignment;
import org.torproject.descriptor.BridgeServerDescriptor;
import org.torproject.descriptor.Descriptor;
import org.torproject.descriptor.DescriptorFile;
import org.torproject.descriptor.DescriptorReader;
import org.torproject.descriptor.DescriptorSourceFactory;
import org.torproject.descriptor.ExitList;
import org.torproject.descriptor.RelayDirectory;
import org.torproject.descriptor.RelayExtraInfoDescriptor;
import org.torproject.descriptor.RelayNetworkStatus;
import org.torproject.descriptor.RelayNetworkStatusConsensus;
import org.torproject.descriptor.RelayNetworkStatusVote;
import org.torproject.descriptor.RelayServerDescriptor;
import org.torproject.descriptor.TorperfResult;

public class Main {

  public static void main(String[] args) throws IOException {
    File tarballsDirectory = new File(
        "/Users/karsten/backup/collector-backup");
    File logFile = new File("wrongmonth.log");
    new Main(tarballsDirectory, logFile).parseTarballsDirectory();
  }

  private BufferedWriter bw;

  private File tarballsDirectory;

  DateFormat yearMonthFormat;

  public Main(File tarballsDirectory, File logFile) throws IOException {
    this.bw = new BufferedWriter(new FileWriter(logFile));
    this.tarballsDirectory = tarballsDirectory;
    this.yearMonthFormat = new SimpleDateFormat("yyyy-MM", Locale.US);
    this.yearMonthFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
  }

  private void parseTarballsDirectory() throws IOException {
    for (File tarballFile : this.tarballsDirectory.listFiles()) {
      String tarballFilename = tarballFile.getName();
      if (!tarballFilename.contains("-20")) {
        System.out.printf("Cannot extract month from tarball "
            + "'%s'.\n", tarballFilename);
      } else {
        String tarballMonth = tarballFilename.substring(
            tarballFilename.indexOf("-20") + 1);
        tarballMonth = tarballMonth.substring(0, "yyyy-MM".length());
        System.out.printf("Processing tarball '%s'.\n",
            tarballFilename);
        this.parseDescriptors(tarballFile, tarballMonth);
      }
    }
  }

  private void parseDescriptors(File tarballFile, String tarballMonth)
      throws IOException {
    DescriptorReader descriptorReader =
        DescriptorSourceFactory.createDescriptorReader();
    descriptorReader.addTarball(tarballFile);
    descriptorReader.setMaxDescriptorFilesInQueue(10);
    Iterator<DescriptorFile> descriptorFiles =
          descriptorReader.readDescriptors();
    while (descriptorFiles.hasNext()) {
      DescriptorFile descriptorFile = descriptorFiles.next();
      for (Descriptor descriptor : descriptorFile.getDescriptors()) {
        long publishedMillis = this.extractPublishedMonth(descriptor);
        String publishedMonth = this.yearMonthFormat.format(
            publishedMillis);
        if (!tarballMonth.equals(publishedMonth)) {
          System.out.printf("Tarball '%s' contains file "
              + "'%s' with a descriptor published at '%s'.\n",
              tarballFile.getName(), descriptorFile.getFileName(),
              publishedMonth);
        }
      }
    }
  }

  private long extractPublishedMonth(Descriptor descriptor) {
    long publishedMillis = -1L;
    if (descriptor instanceof BridgeNetworkStatus) {
      publishedMillis = ((BridgeNetworkStatus) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof BridgeServerDescriptor) {
      publishedMillis = ((BridgeServerDescriptor) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof BridgeExtraInfoDescriptor) {
      publishedMillis = ((BridgeExtraInfoDescriptor) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof BridgePoolAssignment) {
      publishedMillis = ((BridgePoolAssignment) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof RelayNetworkStatusConsensus) {
      publishedMillis = ((RelayNetworkStatusConsensus) descriptor)
          .getValidAfterMillis();
    } else if (descriptor instanceof ExitList) {
      publishedMillis = ((ExitList) descriptor)
          .getDownloadedMillis();
    } else if (descriptor instanceof RelayExtraInfoDescriptor) {
      publishedMillis = ((RelayExtraInfoDescriptor) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof RelayServerDescriptor) {
      publishedMillis = ((RelayServerDescriptor) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof RelayNetworkStatus) {
      publishedMillis = ((RelayNetworkStatus) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof RelayDirectory) {
      publishedMillis = ((BridgePoolAssignment) descriptor)
          .getPublishedMillis();
    } else if (descriptor instanceof TorperfResult) {
      publishedMillis = ((TorperfResult) descriptor)
          .getStartMillis();
    } else if (descriptor instanceof RelayNetworkStatusVote) {
      publishedMillis = ((RelayNetworkStatusVote) descriptor)
          .getValidAfterMillis();
    }
    return publishedMillis;
  }
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Main.java.sig
Type: application/octet-stream
Size: 287 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160616/1c0c8f55/attachment.obj>


More information about the metrics-team mailing list