Database schema for metrics portal

Karsten Loesing karsten.loesing at gmx.net
Fri May 21 11:06:12 UTC 2010


Hi Kevin,

here are some thoughts on importing relay descriptors into a database.

You may already have noticed that there are three documents holding
information about a relay in the Tor network: network status entry,
server descriptor, and extra-info descriptor. In the past, I have used
three tables and joined them whenever necessary. It turned out that join
operations are highly inefficient and that it doesn't even make sense to
join extra-info descriptors with network status entries and server
descriptors.

Here's an example. The current network status consensus has this entry
for gabelmoo:

r gabelmoo 8gREE9rC4C49a89HNaGbyh3pcoE gL7MFClJBGG6l3KzCXJPh5fQnl8
2010-05-21 02:34:42 80.190.246.100 8080 8180
s Authority Fast Guard HSDir Running Stable Unnamed V2Dir Valid
v Tor 0.2.2.13-alpha
w Bandwidth=512
p reject 1-65535


The referenced server descriptor gL7MFCl... in base64 or
80becc1429490461ba9772b309724f8797d09e5f in hex has this content:

router gabelmoo 80.190.246.100 8080 0 8180
platform Tor 0.2.2.13-alpha (git-a3db6300225c5f34) on Linux i686
opt protocols Link 1 2 Circuit 1
published 2010-05-21 02:34:42
opt fingerprint F204 4413 DAC2 E02E 3D6B CF47 35A1 9BCA 1DE9 7281
uptime 1154088
bandwidth 512000 1280000 1166966
opt extra-info-digest CECC7F83F5CD771BF0DF9827C08F561CE84A5DA3
opt caches-extra-info
onion-key
-----BEGIN RSA PUBLIC KEY-----
MIGJAoGBANbj/L3FZCbIKNdBKMH1tamU/r+9p+X4fI+1i9jDtHt11D4Rz+rsQ4CX
8no0NMx2yOHpvq7Cp8scO2ZcH4UjQ/bJIumgGqjuslWMn9KtrbBKdtNzVtOMNjYY
7zjZJ8UVmS1/hH2wl2dwZeAVbQ931u9DWX5R/u1ck28EQS8C2Z/jAgMBAAE=
-----END RSA PUBLIC KEY-----
signing-key
-----BEGIN RSA PUBLIC KEY-----
MIGJAoGBAMjPhp0QAN8WQeP8sDg+mnM7hnPExcXTOfsmt7Sl2k3fHcHeAJnYu10V
/hb6RBhubi1HRg6fYF5PwZWOppxRj90WX3n2JEcQh88+4tuQNQ2jOxXQ/hBkyf0w
klrx6Fh8ana1VX+QXInfRW5z3eaANXqvdCvDG5jfUrz6pj8WGPrxAgMBAAE=
-----END RSA PUBLIC KEY-----
opt hidden-service-dir
contact 1024D/F7C11265 Karsten Loesing <karsten dot loesing AT gmx dot net>
reject *:*
router-signature
-----BEGIN SIGNATURE-----
FxYg32MXPM2aOoa5q244CgSs0/EUw4nKJsxde0sD2f3zav7dXdBJBTrv0GdPETUC
iv125vk/zCNNljk2c3bbNAKVnFbBYyBXnJ7/BJL+kSAPYTpQGpIw6RXl3NtWhyZa
08weBwj1bY+xZzYzgB7yuhPvYzH4F/Ji6qLzRsTfJp0=
-----END SIGNATURE-----


And finally, the referenced extra-info descriptor cecc7f83... is:

extra-info gabelmoo F2044413DAC2E02E3D6BCF4735A19BCA1DE97281
published 2010-05-21 02:34:42
write-history 2010-05-21 02:32:48 (900 s)
544401408,540996608,642924544,583867392,653665280,637046784,684775424,598684672,586347520,658676736,630778880,719997952,486724608,622078976,748093440,589464576,539460608,801062912,825155584,805288960,791465984,678842368,636246016,630557696,679609344,666486784,621635584,758945792,654635008,513009664,617194496,636549120,748435456,865181696,736525312,715303936,742599680,690369536,870909952,873070592,690373632,661690368,703024128,689483776,837165056,626258944,741469184,897375232,802021376,669841408,719076352,717147136,819868672,731346944,717606912,574259200,624780288,682221568,708850688,666567680,734041088,725368832,799907840,881100800,896225280,907387904,854701056,934236160,910132224,856087552,894130176,831595520,846332928,831883264,925052928,840542208,701243392,741879808,707265536,858382336,931992576,926849024,923104256,888730624,934275072,929859584,938267648,938893312,929412096,928772096,876532736,828409856,885524480,806822912,656070656,632248320
read-history 2010-05-21 02:32:48 (900 s)
540297216,534199296,628869120,572840960,648410112,631605248,663860224,588089344,580211712,650313728,608541696,701733888,478575616,611846144,726459392,572938240,526454784,786325504,806251520,796734464,787733504,669499392,621448192,617405440,674842624,660013056,599517184,743075840,644049920,506447872,603792384,622928896,742297600,860128256,712641536,694036480,735109120,682270720,852083712,859357184,683282432,656016384,680463360,673383424,831110144,622981120,720413696,891073536,798917632,669814784,705187840,709612544,818158592,729244672,703837184,564427776,625363968,682468352,691994624,660142080,733149184,725949440,785703936,875602944,896345088,909398016,846983168,933785600,917148672,854514688,884723712,822371328,844339200,821991424,913144832,828115968,695621632,737174528,691364864,847837184,934027264,929199104,912046080,877563904,940472320,937701376,935174144,940459008,940481536,938059776,867339264,817305600,884368384,800684032,641873920,621056000
dirreq-stats-end 2010-05-20 17:59:54 (86400 s)
dirreq-v3-ips
us=880,de=752,fr=320,it=224,kr=224,ru=200,gb=176,pl=128,ca=96,ng=88,nl=88,ir=80,jp=80,pk=80,br=72,se=72,at=64,au=56,ch=56,es=56,ph=48,ua=48,mx=40,sn=40,th=40,tr=40,be=32,fi=32,in=32,tw=32,ar=24,bg=24,cz=24,dk=24,eg=24,gh=24,gr=24,ie=24,no=24,ro=24,sg=24,vn=24,a1=16,by=16,ci=16,co=16,dz=16,hk=16,hr=16,hu=16,id=16,il=16,kw=16,kz=16,ma=16,my=16,nz=16,pt=16,sa=16,si=16,sk=16,ve=16,za=16,??=8,a2=8,ae=8,ao=8,aw=8,bd=8,bj=8,bn=8,bo=8,bs=8,cl=8,cr=8,cy=8,do=8,ec=8,ee=8,et=8,eu=8,ge=8,gg=8,gt=8,hn=8,ht=8,iq=8,is=8,jo=8,ke=8,kg=8,lb=8,lt=8,lu=8,lv=8,md=8,mn=8,mt=8,mu=8,nc=8,pa=8,pe=8,pf=8,pr=8,ps=8,qa=8,rs=8,sc=8,sd=8,sm=8,so=8,sv=8,sy=8,tn=8,tt=8,tz=8,uz=8,ye=8
dirreq-v2-ips ng=8
dirreq-v3-reqs
us=1992,de=1296,fr=568,ru=344,kr=304,gb=280,it=256,ng=208,nl=200,se=176,pl=152,at=136,ca=112,br=104,jp=104,ir=88,ph=88,ar=80,es=80,pk=80,ua=80,au=72,ch=72,tw=64,fi=56,be=48,cz=48,th=48,tr=48,bg=40,in=40,mx=40,sn=40,dk=32,gr=32,no=32,ro=32,sg=32,eg=24,gh=24,hk=24,id=24,ie=24,il=24,pt=24,sk=24,vn=24,a1=16,a2=16,ae=16,by=16,ci=16,cl=16,co=16,dz=16,ee=16,hr=16,hu=16,kw=16,kz=16,lb=16,lv=16,ma=16,mn=16,my=16,nz=16,ps=16,sa=16,si=16,tn=16,ve=16,za=16,??=8,ao=8,aw=8,bd=8,bj=8,bn=8,bo=8,bs=8,cr=8,cy=8,do=8,ec=8,et=8,eu=8,ge=8,gg=8,gt=8,hn=8,ht=8,iq=8,is=8,jo=8,ke=8,kg=8,lt=8,lu=8,md=8,mt=8,mu=8,nc=8,pa=8,pe=8,pf=8,pr=8,qa=8,rs=8,sc=8,sd=8,sm=8,so=8,sv=8,sy=8,tt=8,tz=8,uz=8,ye=8
dirreq-v2-reqs ng=8
dirreq-v3-resp
ok=8000,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=0,busy=0
dirreq-v2-resp ok=8,unavailable=0,not-found=0,not-modified=0,busy=0
dirreq-v2-share 0.00%
dirreq-v3-share 0.00%
dirreq-v3-direct-dl
complete=3924,timeout=24,running=0,min=529,d1=49744,d2=84867,q1=103848,d3=119618,d4=165491,md=215631,d6=293536,d7=428605,q3=500448,d8=656839,d9=1507344,max=25818600
dirreq-v2-direct-dl complete=4,timeout=0,running=0
dirreq-v3-tunneled-dl
complete=3968,timeout=88,running=4,min=161,d1=24326,d2=51796,q1=61586,d3=71895,d4=91512,md=121749,d6=161907,d7=231554,q3=285958,d8=343692,d9=529069,max=25762200
dirreq-v2-tunneled-dl complete=0,timeout=0,running=0
entry-stats-end 2010-05-20 17:59:54 (86400 s)
entry-ips
de=1832,us=1680,kr=912,fr=680,it=672,ru=416,gb=384,pl=352,ir=336,jp=264,ng=240,ca=216,br=208,es=168,in=152,pk=144,at=128,au=128,nl=128,ph=128,se=128,tr=128,ua=112,ch=104,th=104,mx=96,tw=80,cz=72,ar=64,fi=64,sn=64,be=56,hu=56,ie=56,il=56,tn=56,vn=56,dk=48,gh=48,gr=48,id=48,kw=48,ma=48,no=48,ro=48,sa=48,??=40,bg=40,ci=40,eg=40,hk=40,my=40,pt=40,za=40,co=32,dz=32,nz=32,sg=32,a1=24,by=24,cl=24,kz=24,sk=24,ve=24,a2=16,bd=16,bj=16,hr=16,jo=16,ke=16,lb=16,lt=16,lu=16,pe=16,ps=16,rs=16,sd=16,si=16,sy=16,ye=16,ae=8,af=8,al=8,am=8,ao=8,aw=8,az=8,ba=8,bb=8,bh=8,bn=8,bo=8,bs=8,cm=8,cn=8,cr=8,cy=8,dj=8,do=8,ec=8,ee=8,eu=8,fj=8,fo=8,ge=8,gg=8,gt=8,gy=8,hn=8,ht=8,iq=8,is=8,jm=8,kg=8,ky=8,lk=8,lv=8,ly=8,mc=8,md=8,me=8,mg=8,mk=8,ml=8,mo=8,mr=8,mt=8,mu=8,mv=8,nc=8,ni=8,np=8,pa=8,pf=8,pr=8,py=8,qa=8,sc=8,so=8,sv=8,tg=8,tt=8,tz=8,ug=8,uy=8,uz=8
cell-stats-end 2010-05-20 17:59:54 (86400 s)
cell-processed-cells 9890,208,69,27,16,7,5,5,4,2
cell-queued-cells 2.63,0.77,0.02,0.00,0.00,0.00,0.00,0.00,0.00,0.00
cell-time-in-queue 911,505,39,11,15,9,9,20,46,197
cell-circuits-per-decile 11482
exit-stats-end 2010-05-20 17:59:54 (86400 s)
exit-kibibytes-written other=0
exit-kibibytes-read other=0
exit-streams-opened other=0
router-signature
-----BEGIN SIGNATURE-----
N1CbWArG7rBEXa0r6dprMSh/HkdcpjtRyG0zScAJAYp76j4Nm/bIZz8UbDTwauqN
G9ysHp+qUf4IAuSDrYeLrSo5mrHuQOEDfEfA04eJ9bS2QzTFYZeO6kEFXlSprnc3
IC23R/isVEpDqM2qaWKnBIo3mT7bKram7GO1weIDfR0=
-----END SIGNATURE-----


One thing you can see here is that all fields in extra-info descriptors
have their own timestamps. None of the data in extra-info descriptors
requires knowledge of the corresponding server descriptor or network
status entry. If you want to analyze data from the extra-info
descriptors, you can import them into a table of its own.

So, what about joining network status entries and server descriptors?
There are statistics when it makes sense to join these documents. For
example, you might want to evaluate assigned relay flags (Stable, Guard;
contained in network status entry) and uptime (contained in server
descriptor) at the same time. In particular, it's almost never useful to
base statistics solely on server descriptors, because relays are not
even confirmed to be reachable when you only have a server descriptor.

How do you avoid join operations in your queries? I'd think that having
2 import tables for network status entries and server descriptors and 1
unnormalized table containing all data for aggregation/querying might be
a good approach. You might be able to update that 1 table by using
triggers on the other 2 tables. You'll have lots of redundant data in
your database: all fields in your server descriptors can be contained in
up to 24 rows, as the maximum amount of time a server descriptor is
valid is 24 hours. But that's probably fine, as there's not so much
stuff in server descriptors that's useful for statistics anyway. And
you'll be excited how fast queries are without having to join tables. :)


Also, as discussed on #tor-dev, below are some notes on creating a
database schema for the BIRT-based metrics portal we had a couple of
months ago (and only for a very short time). I don't think you can
re-use everything as-is, but maybe you find some of the ideas useful for
your database schema.

Best,
--Karsten


Data table
----------

The data table only includes entries from network statuses, not from
server descriptors or extra-info descriptors.

  CREATE TABLE statusentry (
    validafter TIMESTAMP WITHOUT TIME ZONE NOT NULL,
    fingerprint CHARACTER(40),
    hashed_fingerprint CHARACTER(40) NOT NULL,
    bridge BOOLEAN NOT NULL DEFAULT false,
    descriptor CHARACTER(40) NOT NULL,
    authority BOOLEAN NOT NULL DEFAULT false,
    badexit BOOLEAN NOT NULL DEFAULT false,
    baddirectory BOOLEAN NOT NULL DEFAULT false,
    exit BOOLEAN NOT NULL DEFAULT false,
    fast BOOLEAN NOT NULL DEFAULT false,
    guard BOOLEAN NOT NULL DEFAULT false,
    hsdir BOOLEAN NOT NULL DEFAULT false,
    named BOOLEAN NOT NULL DEFAULT false,
    stable BOOLEAN NOT NULL DEFAULT false,
    running BOOLEAN NOT NULL DEFAULT false,
    unnamed BOOLEAN NOT NULL DEFAULT false,
    valid BOOLEAN NOT NULL DEFAULT false,
    v2dir BOOLEAN NOT NULL DEFAULT false,
    v3dir BOOLEAN NOT NULL DEFAULT false,
    bandwidthmeasured BIGINT CHECK (bandwidthmeasured >= 0),
    country CHARACTER(2),
    PRIMARY KEY (validafter, hashed_fingerprint, bridge));


Network size
------------

The network size is determined as the mean number of running relays or
bridges on a given day. These numbers are calculated from the network
status by counting the number of running relays or bridges over the day
and dividing them by the number of network statuses per that day:

  CREATE VIEW network_size_v AS
    SELECT DATE(validafter),
      SUM(CASE WHEN bridge IS FALSE THEN 1 ELSE 0 END) /
        relay_statuses_per_day.count AS avg_relays,
      SUM(CASE WHEN bridge IS TRUE THEN 1 ELSE 0 END) /
        bridge_statuses_per_day.count AS avg_bridges
    FROM statusentry,
      (SELECT COUNT(*) AS count, DATE(validafter) AS date
        FROM (SELECT DISTINCT validafter FROM statusentry
          WHERE bridge IS FALSE)
        distinct_consensuses GROUP BY DATE(validafter))
        relay_statuses_per_day,
      (SELECT COUNT(*) AS count, DATE(validafter) AS date
        FROM (SELECT DISTINCT validafter FROM statusentry
          WHERE bridge IS TRUE)
        distinct_consensuses GROUP BY DATE(validafter))
        bridge_statuses_per_day
    WHERE statusentry.running IS TRUE
      AND DATE(validafter) = relay_statuses_per_day.date
      AND DATE(validafter) = bridge_statuses_per_day.date
    GROUP BY DATE(validafter), relay_statuses_per_day.count,
      bridge_statuses_per_day.count
    ORDER BY DATE(validafter);

The resulting view is 'materialized' by creating a table with the
contents of the view. Finally, the query role gets select rights for the
materialized view:

  CREATE TABLE network_size_mv as SELECT * FROM network_size_v;

  GRANT SELECT ON network_size_mv TO query;


Platforms and relay versions
----------------------------

The reports on platforms and relay versions are based on network status
entries in combination with relay descriptors. The information whether a
relay was running and contained in a network status comes from the
statusentry relation, whereas the simplified platform string comes from
the descriptor relation. For each date, the number of running relays are
counted, grouped by platform string, and divided by the number of stored
network statuses at that day:

  CREATE VIEW platforms_v AS
    SELECT DATE(validafter) AS date, platform,
    COUNT(*) / relay_statuses_per_day.count AS count
    FROM (
      SELECT COUNT(*) AS count, DATE(validafter) AS date
      FROM (
        SELECT DISTINCT validafter
        FROM statusentry
        WHERE bridge IS FALSE) distinct_consensuses
      GROUP BY DATE(validafter)) relay_statuses_per_day
    JOIN statusentry ON relay_statuses_per_day.date = DATE(validafter)
    LEFT JOIN descriptor
      ON statusentry.descriptor = descriptor.descriptor
    WHERE running IS TRUE AND bridge IS FALSE
    GROUP BY DATE(validafter), platform, relay_statuses_per_day.count,
      relay_statuses_per_day.date
    ORDER BY DATE(validafter), platform;

The mean numbers of relays per day running a certain Tor version are
determined similarly:

  CREATE VIEW versions_v AS
    SELECT DATE(validafter) AS date,
      SUBSTRING(version, 1, 5) AS version,
      COUNT(*) / relay_statuses_per_day.count AS count
    FROM (
      SELECT COUNT(*) AS count, DATE(validafter) AS date
      FROM (
        SELECT DISTINCT validafter
        FROM statusentry
        WHERE bridge IS FALSE) distinct_consensuses
      GROUP BY DATE(validafter)) relay_statuses_per_day
    JOIN statusentry ON relay_statuses_per_day.date = DATE(validafter)
    LEFT JOIN descriptor
      ON statusentry.descriptor = descriptor.descriptor
    WHERE running IS TRUE AND bridge IS FALSE
    GROUP BY DATE(validafter), SUBSTRING(version, 1, 5),
      relay_statuses_per_day.count, relay_statuses_per_day.date
    ORDER BY DATE(validafter), SUBSTRING(version, 1, 5);

Finally, two tables are created with the contents of these two views in
order to accelerate querying, and select rights are granted to the query
role:

  CREATE TABLE platforms_mv as SELECT * FROM platforms_v;

  CREATE TABLE versions_mv as SELECT * FROM versions_v;

  GRANT SELECT ON platforms_mv TO query;

  GRANT SELECT ON versions_mv TO query;


Bridge churn
------------

Bridge churn is visualized as time plot with four lines showing how many
of the running bridges at a given time are still running 1
day/week/month later. The fewer bridges are missing after those times,
the more stable the bridge population is.

The necessary query table for this report is constructed in multiple
steps. To begin with, the set of bridges in network statuses is reduced
to the first network status of each day:

  CREATE VIEW bridges_day_v AS
    SELECT DATE(validafter), hashed_fingerprint
    FROM statusentry
    WHERE validafter IN (
      SELECT MIN(validafter)
      FROM statusentry
      WHERE bridge IS true AND running IS true
      GROUP BY DATE(validafter))
    AND bridge IS true AND running IS true;

The resulting view is materialized, as it is used by multiple other views:

  CREATE TABLE bridges_day_mv AS SELECT * FROM bridges_day_v;

In the next step, four views are created that determine how many bridges
overlap in the sets of a given date and a date 1 day/week/month in the
future from then:

  CREATE VIEW bridges_churn_now_v AS
    SELECT date, COUNT(*)
    FROM bridges_day_mv
    GROUP BY date;

  CREATE VIEW bridges_churn_day_v AS
    SELECT now.date, COUNT(*)
    FROM bridges_day_mv now, bridges_day_mv future
    WHERE now.date = DATE(future.date + interval '1 day')
      AND now.hashed_fingerprint = future.hashed_fingerprint
    GROUP BY now.date, future.date;

  CREATE VIEW bridges_churn_week_v AS
    SELECT now.date, COUNT(*)
    FROM bridges_day_mv now, bridges_day_mv future
    WHERE now.date = date(future.date + interval '1 week')
      AND now.hashed_fingerprint = future.hashed_fingerprint
    GROUP BY now.date, future.date;

  CREATE VIEW bridges_churn_month_v AS
    SELECT now.date, COUNT(*)
    FROM bridges_day_mv now, bridges_day_mv future
    WHERE now.date = date(future.date + interval '1 month')
      AND now.hashed_fingerprint = future.hashed_fingerprint
    GROUP BY now.date, future.date;

In the last step, these four views are joined into one common view:

  CREATE VIEW bridges_churn_v AS
    SELECT bridges_churn_now_v.date, bridges_churn_now_v.count AS now,
      bridges_churn_day_v.count AS day,
      bridges_churn_week_v.count AS week,
      bridges_churn_month_v.count AS month
    FROM bridges_churn_now_v
      LEFT JOIN bridges_churn_day_v
        ON bridges_churn_now_v.date = bridges_churn_day_v.date
      LEFT JOIN bridges_churn_week_v
        ON bridges_churn_day_v.date = bridges_churn_week_v.date
      LEFT JOIN bridges_churn_month_v
        ON bridges_churn_week_v.date = bridges_churn_month_v.date
      ORDER BY bridges_churn_now_v.date;

In order to improve query performance, a materialized view is created
based on the last view and is given select rights to the query role:

  CREATE TABLE bridges_churn_mv AS SELECT * FROM bridges_churn_v;

  GRANT SELECT ON bridges_churn_mv TO query;


Bridge churn histogram
----------------------

Bridge churn can be visualized as histogram with bars representing the
number of bridges that have been seen as running for a certain number of
days throughout 1 month.  The higher the bars for high numbers of days,
the more stable the bridge population is.

The underlying data set for this report is built from a sequence of
views, starting with a view on the bridges seen on the first days of a
month:

  CREATE VIEW bridges_hist_first_day_v AS
    SELECT DATE_TRUNC('month', validafter) AS first_day,
      hashed_fingerprint
    FROM statusentry
    WHERE DATE(validafter) IN (
      SELECT DATE(MIN(validafter))
      FROM statusentry
      WHERE bridge IS true AND running IS true
      GROUP BY DATE_TRUNC('month', validafter))
    AND bridge IS true AND running IS true
    GROUP BY DATE_TRUNC('month', validafter), hashed_fingerprint;

In the next step, all days are selected on which these bridges have been
seen as running:

  CREATE VIEW bridges_hist_seen_v AS
    SELECT DATE(statusentry.validafter), statusentry.hashed_fingerprint
    FROM statusentry, bridges_hist_first_day_v
    WHERE DATE_TRUNC('month', statusentry.validafter) =
        bridges_hist_first_day_v.first_day
      AND statusentry.hashed_fingerprint =
        bridges_hist_first_day_v.hashed_fingerprint
    GROUP BY DATE(statusentry.validafter), statusentry.hashed_fingerprint;

For each distinct bridge in each month, the number of days that this
bridge has been seen as running are counted:

  CREATE VIEW bridges_hist_pre_v AS
    SELECT COUNT(*) AS days, DATE_TRUNC('month', date) AS month
    FROM bridges_hist_seen_v
    GROUP BY DATE_TRUNC('month', date), hashed_fingerprint;

Finally, the number of bridges for a given number of days are added up:

  CREATE VIEW bridges_hist_v AS
    SELECT COUNT(*), days, DATE(month) AS month
    FROM bridges_hist_pre_v
    GROUP BY month, days
    ORDER BY month, days;

A materialized view is created and given the select rights for the query
role:

  CREATE TABLE bridges_hist_mv AS SELECT * FROM bridges_hist_v;

  GRANT SELECT ON bridges_hist_mv TO query;



More information about the tor-dev mailing list