<div dir="ltr"><span><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">Hey Philipp!</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">Thanks for the interest! I'm one of the authors on the paper. My response is inline.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">On Wednesday, August 19, 2015, Philipp Winter <<a href="mailto:phw@nymity.ch" target="_blank">phw@nymity.ch</a>> wrote:</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">></span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> <<a href="https://kpdyer.com/publications/ccs2015-measurement.pdf" target="_blank">https://kpdyer.com/publications/ccs2015-measurement.pdf</a>></span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">></span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> They claim that they are able to detect obfs3, obfs4, FTE, and meek</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> using entropy analysis and machine learning.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">></span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> I wonder if their dataset allows for such a conclusion.  They use a</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> (admittedly, large) set of flow traces gathered at a college campus.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> One of the traces is from 2010.  The Internet was a different place back</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> then.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">Correct, we used datasets collected in 2010, 2012, and 2014, which total to >1TB of data and 14M TCP flows.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap"> </span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">We could have, say, just used the 2014 dataset. However, we wanted to show that the choice of dataset matters and even with millions of traces, the collection date and network-sensor location can impact results.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> I would also expect college traces to be very different from</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> country-level traces.  For example, the latter should contain</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> significantly more file sharing, and other traffic that is considered</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> inappropriate in a college setting.  Many countries also have popular</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> web sites and applications that might be completely missing in their</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> data sets.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">That's probably accurate. I bet that even across different types of universities (e.g., technical vs. non-technical) one might see very different patterns. Certainly different countries (e.g., Iran vs. China) will see different patterns, too.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">For that reason, we're going to release our code [1] prior to CCS. Liang Wang, a grad student at University of Wisconsin - Madison, lead a substantial engineering effort to make this possible. We undersold it in the paper, but it makes it easy to re-run all these experiments on new datasets. We'd *love* it if others could rerun the experiments against new datasets and report their results.</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap"> </span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> Considering the rate difference between normal and obfuscated traffic,</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> the false positive rate in the analysis is significant.  Trained</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> classifiers also seem to do badly when classifying traces they weren't</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> trained for.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">We definitely encountered this. If you train on one dataset and test on a different one, then accuracy plummeted.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">I think that raises a really interesting research question: what does it mean for two datasets to be different? For this type of classification problem, what level of granularity/frequency would a network operator train at to achieve optimal accuracy and low false positives? (e.g., do you need a classifier per country? state? city? neighborhood?) Also, how often does one need to retrain? daily? weekly?</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">I guess all we showed is that datasets collected from sensors at different network locations (and years apart) are different enough to impact classifier accuracy. Probably not surprising...</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap"> </span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> The authors suggest active probing to reduce false</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> positives, but don't mention that this doesn't work against obfs4 and</span></p><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">> meek.</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">I don't want to get too off track here, but do obfs4 and meek really resist against active probing from motivated countries? Don't we still have the unsolved bridge/key distribution problem?</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">Finally, we’ll be working on a full version of this paper with additional results. If anyone is interested in reviewing and providing feedback, we’d love to hear it. (Philipp - do you mind if I reach out to you directly?)</span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">-Kevin </span></p><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;vertical-align:baseline;white-space:pre-wrap">[1] <a href="https://github.com/liangw89/obfs-detection" target="_blank">https://github.com/liangw89/obfs-detection</a></span></p></span></div>