Karsten,<div><br></div><div>This is all extremely helpful. I certainly won&#39;t be able to test with the full amount of data, but we can get a look at the performance implications of running queries on both normalized and unnormalized schemas. </div>


<div><br></div><div>As far as a timeline - I think 2 weeks is a fair amount of time for the schema to be finished. Should I finish early (which has a good chance of happening), I can begin working on having R work with the database directly, or making/modifying a descriptor database importer for this schema. I&#39;m really looking to get the db stuff down for a basis from which we can build off. We can discuss this when the time comes. </div>

<div><br></div><div>Thanks,</div><div>Kevin</div><div><br></div><div><br><div class="gmail_quote">On 21 May 2010 07:06, Karsten Loesing <span dir="ltr">&lt;<a href="mailto:karsten.loesing@gmx.net" target="_blank">karsten.loesing@gmx.net</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Kevin,<br>

<br>

here are some thoughts on importing relay descriptors into a database.<br>

<br>

You may already have noticed that there are three documents holding<br>

information about a relay in the Tor network: network status entry,<br>

server descriptor, and extra-info descriptor. In the past, I have used<br>

three tables and joined them whenever necessary. It turned out that join<br>

operations are highly inefficient and that it doesn&#39;t even make sense to<br>

join extra-info descriptors with network status entries and server<br>

descriptors.<br>

<br>

Here&#39;s an example. The current network status consensus has this entry<br>

for gabelmoo:<br>

<br>

r gabelmoo 8gREE9rC4C49a89HNaGbyh3pcoE gL7MFClJBGG6l3KzCXJPh5fQnl8<br>

2010-05-21 02:34:42 80.190.246.100 8080 8180<br>

s Authority Fast Guard HSDir Running Stable Unnamed V2Dir Valid<br>

v Tor 0.2.2.13-alpha<br>

w Bandwidth=512<br>

p reject 1-65535<br>

<br>

<br>

The referenced server descriptor gL7MFCl... in base64 or<br>

80becc1429490461ba9772b309724f8797d09e5f in hex has this content:<br>

<br>

router gabelmoo 80.190.246.100 8080 0 8180<br>

platform Tor 0.2.2.13-alpha (git-a3db6300225c5f34) on Linux i686<br>

opt protocols Link 1 2 Circuit 1<br>

published 2010-05-21 02:34:42<br>

opt fingerprint F204 4413 DAC2 E02E 3D6B CF47 35A1 9BCA 1DE9 7281<br>

uptime 1154088<br>

bandwidth 512000 1280000 1166966<br>

opt extra-info-digest CECC7F83F5CD771BF0DF9827C08F561CE84A5DA3<br>

opt caches-extra-info<br>

onion-key<br>

-----BEGIN RSA PUBLIC KEY-----<br>

MIGJAoGBANbj/L3FZCbIKNdBKMH1tamU/r+9p+X4fI+1i9jDtHt11D4Rz+rsQ4CX<br>

8no0NMx2yOHpvq7Cp8scO2ZcH4UjQ/bJIumgGqjuslWMn9KtrbBKdtNzVtOMNjYY<br>

7zjZJ8UVmS1/hH2wl2dwZeAVbQ931u9DWX5R/u1ck28EQS8C2Z/jAgMBAAE=<br>

-----END RSA PUBLIC KEY-----<br>

signing-key<br>

-----BEGIN RSA PUBLIC KEY-----<br>

MIGJAoGBAMjPhp0QAN8WQeP8sDg+mnM7hnPExcXTOfsmt7Sl2k3fHcHeAJnYu10V<br>

/hb6RBhubi1HRg6fYF5PwZWOppxRj90WX3n2JEcQh88+4tuQNQ2jOxXQ/hBkyf0w<br>

klrx6Fh8ana1VX+QXInfRW5z3eaANXqvdCvDG5jfUrz6pj8WGPrxAgMBAAE=<br>

-----END RSA PUBLIC KEY-----<br>

opt hidden-service-dir<br>

contact 1024D/F7C11265 Karsten Loesing &lt;karsten dot loesing AT gmx dot net&gt;<br>

reject *:*<br>

router-signature<br>

-----BEGIN SIGNATURE-----<br>

FxYg32MXPM2aOoa5q244CgSs0/EUw4nKJsxde0sD2f3zav7dXdBJBTrv0GdPETUC<br>

iv125vk/zCNNljk2c3bbNAKVnFbBYyBXnJ7/BJL+kSAPYTpQGpIw6RXl3NtWhyZa<br>

08weBwj1bY+xZzYzgB7yuhPvYzH4F/Ji6qLzRsTfJp0=<br>

-----END SIGNATURE-----<br>

<br>

<br>

And finally, the referenced extra-info descriptor cecc7f83... is:<br>

<br>

extra-info gabelmoo F2044413DAC2E02E3D6BCF4735A19BCA1DE97281<br>

published 2010-05-21 02:34:42<br>

write-history 2010-05-21 02:32:48 (900 s)<br>

544401408,540996608,642924544,583867392,653665280,637046784,684775424,598684672,586347520,658676736,630778880,719997952,486724608,622078976,748093440,589464576,539460608,801062912,825155584,805288960,791465984,678842368,636246016,630557696,679609344,666486784,621635584,758945792,654635008,513009664,617194496,636549120,748435456,865181696,736525312,715303936,742599680,690369536,870909952,873070592,690373632,661690368,703024128,689483776,837165056,626258944,741469184,897375232,802021376,669841408,719076352,717147136,819868672,731346944,717606912,574259200,624780288,682221568,708850688,666567680,734041088,725368832,799907840,881100800,896225280,907387904,854701056,934236160,910132224,856087552,894130176,831595520,846332928,831883264,925052928,840542208,701243392,741879808,707265536,858382336,931992576,926849024,923104256,888730624,934275072,929859584,938267648,938893312,929412096,928772096,876532736,828409856,885524480,806822912,656070656,632248320<br>


read-history 2010-05-21 02:32:48 (900 s)<br>

540297216,534199296,628869120,572840960,648410112,631605248,663860224,588089344,580211712,650313728,608541696,701733888,478575616,611846144,726459392,572938240,526454784,786325504,806251520,796734464,787733504,669499392,621448192,617405440,674842624,660013056,599517184,743075840,644049920,506447872,603792384,622928896,742297600,860128256,712641536,694036480,735109120,682270720,852083712,859357184,683282432,656016384,680463360,673383424,831110144,622981120,720413696,891073536,798917632,669814784,705187840,709612544,818158592,729244672,703837184,564427776,625363968,682468352,691994624,660142080,733149184,725949440,785703936,875602944,896345088,909398016,846983168,933785600,917148672,854514688,884723712,822371328,844339200,821991424,913144832,828115968,695621632,737174528,691364864,847837184,934027264,929199104,912046080,877563904,940472320,937701376,935174144,940459008,940481536,938059776,867339264,817305600,884368384,800684032,641873920,621056000<br>


dirreq-stats-end 2010-05-20 17:59:54 (86400 s)<br>

dirreq-v3-ips<br>

us=880,de=752,fr=320,it=224,kr=224,ru=200,gb=176,pl=128,ca=96,ng=88,nl=88,ir=80,jp=80,pk=80,br=72,se=72,at=64,au=56,ch=56,es=56,ph=48,ua=48,mx=40,sn=40,th=40,tr=40,be=32,fi=32,in=32,tw=32,ar=24,bg=24,cz=24,dk=24,eg=24,gh=24,gr=24,ie=24,no=24,ro=24,sg=24,vn=24,a1=16,by=16,ci=16,co=16,dz=16,hk=16,hr=16,hu=16,id=16,il=16,kw=16,kz=16,ma=16,my=16,nz=16,pt=16,sa=16,si=16,sk=16,ve=16,za=16,??=8,a2=8,ae=8,ao=8,aw=8,bd=8,bj=8,bn=8,bo=8,bs=8,cl=8,cr=8,cy=8,do=8,ec=8,ee=8,et=8,eu=8,ge=8,gg=8,gt=8,hn=8,ht=8,iq=8,is=8,jo=8,ke=8,kg=8,lb=8,lt=8,lu=8,lv=8,md=8,mn=8,mt=8,mu=8,nc=8,pa=8,pe=8,pf=8,pr=8,ps=8,qa=8,rs=8,sc=8,sd=8,sm=8,so=8,sv=8,sy=8,tn=8,tt=8,tz=8,uz=8,ye=8<br>


dirreq-v2-ips ng=8<br>

dirreq-v3-reqs<br>

us=1992,de=1296,fr=568,ru=344,kr=304,gb=280,it=256,ng=208,nl=200,se=176,pl=152,at=136,ca=112,br=104,jp=104,ir=88,ph=88,ar=80,es=80,pk=80,ua=80,au=72,ch=72,tw=64,fi=56,be=48,cz=48,th=48,tr=48,bg=40,in=40,mx=40,sn=40,dk=32,gr=32,no=32,ro=32,sg=32,eg=24,gh=24,hk=24,id=24,ie=24,il=24,pt=24,sk=24,vn=24,a1=16,a2=16,ae=16,by=16,ci=16,cl=16,co=16,dz=16,ee=16,hr=16,hu=16,kw=16,kz=16,lb=16,lv=16,ma=16,mn=16,my=16,nz=16,ps=16,sa=16,si=16,tn=16,ve=16,za=16,??=8,ao=8,aw=8,bd=8,bj=8,bn=8,bo=8,bs=8,cr=8,cy=8,do=8,ec=8,et=8,eu=8,ge=8,gg=8,gt=8,hn=8,ht=8,iq=8,is=8,jo=8,ke=8,kg=8,lt=8,lu=8,md=8,mt=8,mu=8,nc=8,pa=8,pe=8,pf=8,pr=8,qa=8,rs=8,sc=8,sd=8,sm=8,so=8,sv=8,sy=8,tt=8,tz=8,uz=8,ye=8<br>


dirreq-v2-reqs ng=8<br>

dirreq-v3-resp<br>

ok=8000,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=0,busy=0<br>

dirreq-v2-resp ok=8,unavailable=0,not-found=0,not-modified=0,busy=0<br>

dirreq-v2-share 0.00%<br>

dirreq-v3-share 0.00%<br>

dirreq-v3-direct-dl<br>

complete=3924,timeout=24,running=0,min=529,d1=49744,d2=84867,q1=103848,d3=119618,d4=165491,md=215631,d6=293536,d7=428605,q3=500448,d8=656839,d9=1507344,max=25818600<br>

dirreq-v2-direct-dl complete=4,timeout=0,running=0<br>

dirreq-v3-tunneled-dl<br>

complete=3968,timeout=88,running=4,min=161,d1=24326,d2=51796,q1=61586,d3=71895,d4=91512,md=121749,d6=161907,d7=231554,q3=285958,d8=343692,d9=529069,max=25762200<br>

dirreq-v2-tunneled-dl complete=0,timeout=0,running=0<br>

entry-stats-end 2010-05-20 17:59:54 (86400 s)<br>

entry-ips<br>

de=1832,us=1680,kr=912,fr=680,it=672,ru=416,gb=384,pl=352,ir=336,jp=264,ng=240,ca=216,br=208,es=168,in=152,pk=144,at=128,au=128,nl=128,ph=128,se=128,tr=128,ua=112,ch=104,th=104,mx=96,tw=80,cz=72,ar=64,fi=64,sn=64,be=56,hu=56,ie=56,il=56,tn=56,vn=56,dk=48,gh=48,gr=48,id=48,kw=48,ma=48,no=48,ro=48,sa=48,??=40,bg=40,ci=40,eg=40,hk=40,my=40,pt=40,za=40,co=32,dz=32,nz=32,sg=32,a1=24,by=24,cl=24,kz=24,sk=24,ve=24,a2=16,bd=16,bj=16,hr=16,jo=16,ke=16,lb=16,lt=16,lu=16,pe=16,ps=16,rs=16,sd=16,si=16,sy=16,ye=16,ae=8,af=8,al=8,am=8,ao=8,aw=8,az=8,ba=8,bb=8,bh=8,bn=8,bo=8,bs=8,cm=8,cn=8,cr=8,cy=8,dj=8,do=8,ec=8,ee=8,eu=8,fj=8,fo=8,ge=8,gg=8,gt=8,gy=8,hn=8,ht=8,iq=8,is=8,jm=8,kg=8,ky=8,lk=8,lv=8,ly=8,mc=8,md=8,me=8,mg=8,mk=8,ml=8,mo=8,mr=8,mt=8,mu=8,mv=8,nc=8,ni=8,np=8,pa=8,pf=8,pr=8,py=8,qa=8,sc=8,so=8,sv=8,tg=8,tt=8,tz=8,ug=8,uy=8,uz=8<br>


cell-stats-end 2010-05-20 17:59:54 (86400 s)<br>

cell-processed-cells 9890,208,69,27,16,7,5,5,4,2<br>

cell-queued-cells 2.63,0.77,0.02,0.00,0.00,0.00,0.00,0.00,0.00,0.00<br>

cell-time-in-queue 911,505,39,11,15,9,9,20,46,197<br>

cell-circuits-per-decile 11482<br>

exit-stats-end 2010-05-20 17:59:54 (86400 s)<br>

exit-kibibytes-written other=0<br>

exit-kibibytes-read other=0<br>

exit-streams-opened other=0<br>

router-signature<br>

-----BEGIN SIGNATURE-----<br>

N1CbWArG7rBEXa0r6dprMSh/HkdcpjtRyG0zScAJAYp76j4Nm/bIZz8UbDTwauqN<br>

G9ysHp+qUf4IAuSDrYeLrSo5mrHuQOEDfEfA04eJ9bS2QzTFYZeO6kEFXlSprnc3<br>

IC23R/isVEpDqM2qaWKnBIo3mT7bKram7GO1weIDfR0=<br>

-----END SIGNATURE-----<br>

<br>

<br>

One thing you can see here is that all fields in extra-info descriptors<br>

have their own timestamps. None of the data in extra-info descriptors<br>

requires knowledge of the corresponding server descriptor or network<br>

status entry. If you want to analyze data from the extra-info<br>

descriptors, you can import them into a table of its own.<br>

<br>

So, what about joining network status entries and server descriptors?<br>

There are statistics when it makes sense to join these documents. For<br>

example, you might want to evaluate assigned relay flags (Stable, Guard;<br>

contained in network status entry) and uptime (contained in server<br>

descriptor) at the same time. In particular, it&#39;s almost never useful to<br>

base statistics solely on server descriptors, because relays are not<br>

even confirmed to be reachable when you only have a server descriptor.<br>

<br>

How do you avoid join operations in your queries? I&#39;d think that having<br>

2 import tables for network status entries and server descriptors and 1<br>

unnormalized table containing all data for aggregation/querying might be<br>

a good approach. You might be able to update that 1 table by using<br>

triggers on the other 2 tables. You&#39;ll have lots of redundant data in<br>

your database: all fields in your server descriptors can be contained in<br>

up to 24 rows, as the maximum amount of time a server descriptor is<br>

valid is 24 hours. But that&#39;s probably fine, as there&#39;s not so much<br>

stuff in server descriptors that&#39;s useful for statistics anyway. And<br>

you&#39;ll be excited how fast queries are without having to join tables. :)<br>

<br>

<br>

Also, as discussed on #tor-dev, below are some notes on creating a<br>

database schema for the BIRT-based metrics portal we had a couple of<br>

months ago (and only for a very short time). I don&#39;t think you can<br>

re-use everything as-is, but maybe you find some of the ideas useful for<br>

your database schema.<br>

<br>

Best,<br>

--Karsten<br>

<br>

<br>

Data table<br>

----------<br>

<br>

The data table only includes entries from network statuses, not from<br>

server descriptors or extra-info descriptors.<br>

<br>

  CREATE TABLE statusentry (<br>

    validafter TIMESTAMP WITHOUT TIME ZONE NOT NULL,<br>

    fingerprint CHARACTER(40),<br>

    hashed_fingerprint CHARACTER(40) NOT NULL,<br>

    bridge BOOLEAN NOT NULL DEFAULT false,<br>

    descriptor CHARACTER(40) NOT NULL,<br>

    authority BOOLEAN NOT NULL DEFAULT false,<br>

    badexit BOOLEAN NOT NULL DEFAULT false,<br>

    baddirectory BOOLEAN NOT NULL DEFAULT false,<br>

    exit BOOLEAN NOT NULL DEFAULT false,<br>

    fast BOOLEAN NOT NULL DEFAULT false,<br>

    guard BOOLEAN NOT NULL DEFAULT false,<br>

    hsdir BOOLEAN NOT NULL DEFAULT false,<br>

    named BOOLEAN NOT NULL DEFAULT false,<br>

    stable BOOLEAN NOT NULL DEFAULT false,<br>

    running BOOLEAN NOT NULL DEFAULT false,<br>

    unnamed BOOLEAN NOT NULL DEFAULT false,<br>

    valid BOOLEAN NOT NULL DEFAULT false,<br>

    v2dir BOOLEAN NOT NULL DEFAULT false,<br>

    v3dir BOOLEAN NOT NULL DEFAULT false,<br>

    bandwidthmeasured BIGINT CHECK (bandwidthmeasured &gt;= 0),<br>

    country CHARACTER(2),<br>

    PRIMARY KEY (validafter, hashed_fingerprint, bridge));<br>

<br>

<br>

Network size<br>

------------<br>

<br>

The network size is determined as the mean number of running relays or<br>

bridges on a given day. These numbers are calculated from the network<br>

status by counting the number of running relays or bridges over the day<br>

and dividing them by the number of network statuses per that day:<br>

<br>

  CREATE VIEW network_size_v AS<br>

    SELECT DATE(validafter),<br>

      SUM(CASE WHEN bridge IS FALSE THEN 1 ELSE 0 END) /<br>

        relay_statuses_per_day.count AS avg_relays,<br>

      SUM(CASE WHEN bridge IS TRUE THEN 1 ELSE 0 END) /<br>

        bridge_statuses_per_day.count AS avg_bridges<br>

    FROM statusentry,<br>

      (SELECT COUNT(*) AS count, DATE(validafter) AS date<br>

        FROM (SELECT DISTINCT validafter FROM statusentry<br>

          WHERE bridge IS FALSE)<br>

        distinct_consensuses GROUP BY DATE(validafter))<br>

        relay_statuses_per_day,<br>

      (SELECT COUNT(*) AS count, DATE(validafter) AS date<br>

        FROM (SELECT DISTINCT validafter FROM statusentry<br>

          WHERE bridge IS TRUE)<br>

        distinct_consensuses GROUP BY DATE(validafter))<br>

        bridge_statuses_per_day<br>

    WHERE statusentry.running IS TRUE<br>

      AND DATE(validafter) = relay_statuses_per_day.date<br>

      AND DATE(validafter) = bridge_statuses_per_day.date<br>

    GROUP BY DATE(validafter), relay_statuses_per_day.count,<br>

      bridge_statuses_per_day.count<br>

    ORDER BY DATE(validafter);<br>

<br>

The resulting view is &#39;materialized&#39; by creating a table with the<br>

contents of the view. Finally, the query role gets select rights for the<br>

materialized view:<br>

<br>

  CREATE TABLE network_size_mv as SELECT * FROM network_size_v;<br>

<br>

  GRANT SELECT ON network_size_mv TO query;<br>

<br>

<br>

Platforms and relay versions<br>

----------------------------<br>

<br>

The reports on platforms and relay versions are based on network status<br>

entries in combination with relay descriptors. The information whether a<br>

relay was running and contained in a network status comes from the<br>

statusentry relation, whereas the simplified platform string comes from<br>

the descriptor relation. For each date, the number of running relays are<br>

counted, grouped by platform string, and divided by the number of stored<br>

network statuses at that day:<br>

<br>

  CREATE VIEW platforms_v AS<br>

    SELECT DATE(validafter) AS date, platform,<br>

    COUNT(*) / relay_statuses_per_day.count AS count<br>

    FROM (<br>

      SELECT COUNT(*) AS count, DATE(validafter) AS date<br>

      FROM (<br>

        SELECT DISTINCT validafter<br>

        FROM statusentry<br>

        WHERE bridge IS FALSE) distinct_consensuses<br>

      GROUP BY DATE(validafter)) relay_statuses_per_day<br>

    JOIN statusentry ON relay_statuses_per_day.date = DATE(validafter)<br>

    LEFT JOIN descriptor<br>

      ON statusentry.descriptor = descriptor.descriptor<br>

    WHERE running IS TRUE AND bridge IS FALSE<br>

    GROUP BY DATE(validafter), platform, relay_statuses_per_day.count,<br>

      relay_statuses_per_day.date<br>

    ORDER BY DATE(validafter), platform;<br>

<br>

The mean numbers of relays per day running a certain Tor version are<br>

determined similarly:<br>

<br>

  CREATE VIEW versions_v AS<br>

    SELECT DATE(validafter) AS date,<br>

      SUBSTRING(version, 1, 5) AS version,<br>

      COUNT(*) / relay_statuses_per_day.count AS count<br>

    FROM (<br>

      SELECT COUNT(*) AS count, DATE(validafter) AS date<br>

      FROM (<br>

        SELECT DISTINCT validafter<br>

        FROM statusentry<br>

        WHERE bridge IS FALSE) distinct_consensuses<br>

      GROUP BY DATE(validafter)) relay_statuses_per_day<br>

    JOIN statusentry ON relay_statuses_per_day.date = DATE(validafter)<br>

    LEFT JOIN descriptor<br>

      ON statusentry.descriptor = descriptor.descriptor<br>

    WHERE running IS TRUE AND bridge IS FALSE<br>

    GROUP BY DATE(validafter), SUBSTRING(version, 1, 5),<br>

      relay_statuses_per_day.count, relay_statuses_per_day.date<br>

    ORDER BY DATE(validafter), SUBSTRING(version, 1, 5);<br>

<br>

Finally, two tables are created with the contents of these two views in<br>

order to accelerate querying, and select rights are granted to the query<br>

role:<br>

<br>

  CREATE TABLE platforms_mv as SELECT * FROM platforms_v;<br>

<br>

  CREATE TABLE versions_mv as SELECT * FROM versions_v;<br>

<br>

  GRANT SELECT ON platforms_mv TO query;<br>

<br>

  GRANT SELECT ON versions_mv TO query;<br>

<br>

<br>

Bridge churn<br>

------------<br>

<br>

Bridge churn is visualized as time plot with four lines showing how many<br>

of the running bridges at a given time are still running 1<br>

day/week/month later. The fewer bridges are missing after those times,<br>

the more stable the bridge population is.<br>

<br>

The necessary query table for this report is constructed in multiple<br>

steps. To begin with, the set of bridges in network statuses is reduced<br>

to the first network status of each day:<br>

<br>

  CREATE VIEW bridges_day_v AS<br>

    SELECT DATE(validafter), hashed_fingerprint<br>

    FROM statusentry<br>

    WHERE validafter IN (<br>

      SELECT MIN(validafter)<br>

      FROM statusentry<br>

      WHERE bridge IS true AND running IS true<br>

      GROUP BY DATE(validafter))<br>

    AND bridge IS true AND running IS true;<br>

<br>

The resulting view is materialized, as it is used by multiple other views:<br>

<br>

  CREATE TABLE bridges_day_mv AS SELECT * FROM bridges_day_v;<br>

<br>

In the next step, four views are created that determine how many bridges<br>

overlap in the sets of a given date and a date 1 day/week/month in the<br>

future from then:<br>

<br>

  CREATE VIEW bridges_churn_now_v AS<br>

    SELECT date, COUNT(*)<br>

    FROM bridges_day_mv<br>

    GROUP BY date;<br>

<br>

  CREATE VIEW bridges_churn_day_v AS<br>

    SELECT now.date, COUNT(*)<br>

    FROM bridges_day_mv now, bridges_day_mv future<br>

    WHERE now.date = DATE(future.date + interval &#39;1 day&#39;)<br>

      AND now.hashed_fingerprint = future.hashed_fingerprint<br>

    GROUP BY now.date, future.date;<br>

<br>

  CREATE VIEW bridges_churn_week_v AS<br>

    SELECT now.date, COUNT(*)<br>

    FROM bridges_day_mv now, bridges_day_mv future<br>

    WHERE now.date = date(future.date + interval &#39;1 week&#39;)<br>

      AND now.hashed_fingerprint = future.hashed_fingerprint<br>

    GROUP BY now.date, future.date;<br>

<br>

  CREATE VIEW bridges_churn_month_v AS<br>

    SELECT now.date, COUNT(*)<br>

    FROM bridges_day_mv now, bridges_day_mv future<br>

    WHERE now.date = date(future.date + interval &#39;1 month&#39;)<br>

      AND now.hashed_fingerprint = future.hashed_fingerprint<br>

    GROUP BY now.date, future.date;<br>

<br>

In the last step, these four views are joined into one common view:<br>

<br>

  CREATE VIEW bridges_churn_v AS<br>

    SELECT bridges_churn_now_v.date, bridges_churn_now_v.count AS now,<br>

      bridges_churn_day_v.count AS day,<br>

      bridges_churn_week_v.count AS week,<br>

      bridges_churn_month_v.count AS month<br>

    FROM bridges_churn_now_v<br>

      LEFT JOIN bridges_churn_day_v<br>

        ON bridges_churn_now_v.date = bridges_churn_day_v.date<br>

      LEFT JOIN bridges_churn_week_v<br>

        ON bridges_churn_day_v.date = bridges_churn_week_v.date<br>

      LEFT JOIN bridges_churn_month_v<br>

        ON bridges_churn_week_v.date = bridges_churn_month_v.date<br>

      ORDER BY bridges_churn_now_v.date;<br>

<br>

In order to improve query performance, a materialized view is created<br>

based on the last view and is given select rights to the query role:<br>

<br>

  CREATE TABLE bridges_churn_mv AS SELECT * FROM bridges_churn_v;<br>

<br>

  GRANT SELECT ON bridges_churn_mv TO query;<br>

<br>

<br>

Bridge churn histogram<br>

----------------------<br>

<br>

Bridge churn can be visualized as histogram with bars representing the<br>

number of bridges that have been seen as running for a certain number of<br>

days throughout 1 month.  The higher the bars for high numbers of days,<br>

the more stable the bridge population is.<br>

<br>

The underlying data set for this report is built from a sequence of<br>

views, starting with a view on the bridges seen on the first days of a<br>

month:<br>

<br>

  CREATE VIEW bridges_hist_first_day_v AS<br>

    SELECT DATE_TRUNC(&#39;month&#39;, validafter) AS first_day,<br>

      hashed_fingerprint<br>

    FROM statusentry<br>

    WHERE DATE(validafter) IN (<br>

      SELECT DATE(MIN(validafter))<br>

      FROM statusentry<br>

      WHERE bridge IS true AND running IS true<br>

      GROUP BY DATE_TRUNC(&#39;month&#39;, validafter))<br>

    AND bridge IS true AND running IS true<br>

    GROUP BY DATE_TRUNC(&#39;month&#39;, validafter), hashed_fingerprint;<br>

<br>

In the next step, all days are selected on which these bridges have been<br>

seen as running:<br>

<br>

  CREATE VIEW bridges_hist_seen_v AS<br>

    SELECT DATE(statusentry.validafter), statusentry.hashed_fingerprint<br>

    FROM statusentry, bridges_hist_first_day_v<br>

    WHERE DATE_TRUNC(&#39;month&#39;, statusentry.validafter) =<br>

        bridges_hist_first_day_v.first_day<br>

      AND statusentry.hashed_fingerprint =<br>

        bridges_hist_first_day_v.hashed_fingerprint<br>

    GROUP BY DATE(statusentry.validafter), statusentry.hashed_fingerprint;<br>

<br>

For each distinct bridge in each month, the number of days that this<br>

bridge has been seen as running are counted:<br>

<br>

  CREATE VIEW bridges_hist_pre_v AS<br>

    SELECT COUNT(*) AS days, DATE_TRUNC(&#39;month&#39;, date) AS month<br>

    FROM bridges_hist_seen_v<br>

    GROUP BY DATE_TRUNC(&#39;month&#39;, date), hashed_fingerprint;<br>

<br>

Finally, the number of bridges for a given number of days are added up:<br>

<br>

  CREATE VIEW bridges_hist_v AS<br>

    SELECT COUNT(*), days, DATE(month) AS month<br>

    FROM bridges_hist_pre_v<br>

    GROUP BY month, days<br>

    ORDER BY month, days;<br>

<br>

A materialized view is created and given the select rights for the query<br>

role:<br>

<br>

  CREATE TABLE bridges_hist_mv AS SELECT * FROM bridges_hist_v;<br>

<br>

  GRANT SELECT ON bridges_hist_mv TO query;<br>

<br>

</blockquote></div><br></div>