TY - GEN
T1 - ConceptDoppler
T2 - 14th ACM Conference on Computer and Communications Security, CCS'07
AU - Crandall, Jedidiah R.
AU - Zinn, Daniel
AU - Byrd, Michael
AU - Barr, Earl
AU - East, Rich
PY - 2007
Y1 - 2007
N2 - The text of this paper has passed across many Internet routers on its way to the reader, but some routers will not pass it along unfettered because of censored words it contains. We present two sets of results: 1) Internet measurements of keyword filtering by the Great "Firewall" of China (GFC); and 2) initial results of using latent semantic analysis as an efficient way to reproduce a blacklist of censored words via probing. Our Internet measurements suggest that the GFCs keyword filtering is more a panopticon than a firewall, i.e., it need not block every illicit word, but only enough to promote self-censorship. Chinas largest ISP, ChinaNET, performed 83.3% of all filtering of our probes, and 99.1% of all filtering that occurred at the first hop past the Chinese border. Filtering occurred beyond the third hop for 11.8% of our probes, and there were sometimes as many as 13 hops past the border to a filtering router. Approximately 28.3% of the Chinese hosts we sent probes to were reachable along paths that were not filtered at all. While more tests are needed to provide a definitive picture of the GFCs implementation, our results disprove the notion that GFC keyword filtering is a firewall strictly at the border of Chinas Internet. While evading a firewall a single time defeats its purpose, it would be necessary to evade a panopticon almost every time. Thus, in lieu of evasion, we propose ConceptDoppler, an architecture for maintaining a censorship "weather report" about what keywords are filtered over time. Probing with potentially filtered keywords is arduous due to the GFCs complexity and can be invasive if not done efficiently. Just as an understanding of the mixing of gases preceded effective weather reporting, understanding of the relationship between keywords and concepts is essential for tracking Internet censorship. We show that LSA can effectively pare down a corpus of text and cluster filtered keywords for efficient probing, present 122 keywords we discovered by probing, and underscore the need for tracking and studying censorship blacklists by discovering some surprising blacklisted keywords such as l-. (conversion rate),-.K.(Mein Kampf), andE0(fT-- (International geological scientific federation (Beijing)).
AB - The text of this paper has passed across many Internet routers on its way to the reader, but some routers will not pass it along unfettered because of censored words it contains. We present two sets of results: 1) Internet measurements of keyword filtering by the Great "Firewall" of China (GFC); and 2) initial results of using latent semantic analysis as an efficient way to reproduce a blacklist of censored words via probing. Our Internet measurements suggest that the GFCs keyword filtering is more a panopticon than a firewall, i.e., it need not block every illicit word, but only enough to promote self-censorship. Chinas largest ISP, ChinaNET, performed 83.3% of all filtering of our probes, and 99.1% of all filtering that occurred at the first hop past the Chinese border. Filtering occurred beyond the third hop for 11.8% of our probes, and there were sometimes as many as 13 hops past the border to a filtering router. Approximately 28.3% of the Chinese hosts we sent probes to were reachable along paths that were not filtered at all. While more tests are needed to provide a definitive picture of the GFCs implementation, our results disprove the notion that GFC keyword filtering is a firewall strictly at the border of Chinas Internet. While evading a firewall a single time defeats its purpose, it would be necessary to evade a panopticon almost every time. Thus, in lieu of evasion, we propose ConceptDoppler, an architecture for maintaining a censorship "weather report" about what keywords are filtered over time. Probing with potentially filtered keywords is arduous due to the GFCs complexity and can be invasive if not done efficiently. Just as an understanding of the mixing of gases preceded effective weather reporting, understanding of the relationship between keywords and concepts is essential for tracking Internet censorship. We show that LSA can effectively pare down a corpus of text and cluster filtered keywords for efficient probing, present 122 keywords we discovered by probing, and underscore the need for tracking and studying censorship blacklists by discovering some surprising blacklisted keywords such as l-. (conversion rate),-.K.(Mein Kampf), andE0(fT-- (International geological scientific federation (Beijing)).
KW - Blacklist
KW - ConceptDoppler
KW - Firewall ruleset discovery
KW - Great firewall of China
KW - Internet censorship
KW - Internet measurement
KW - Keyword filtering
KW - LSA
KW - Latent semantic analysis
KW - Latent semantic indexing
KW - Panopticon
UR - http://www.scopus.com/inward/record.url?scp=74049132005&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=74049132005&partnerID=8YFLogxK
U2 - 10.1145/1315245.1315290
DO - 10.1145/1315245.1315290
M3 - Conference contribution
AN - SCOPUS:74049132005
SN - 9781595937032
T3 - Proceedings of the ACM Conference on Computer and Communications Security
SP - 352
EP - 365
BT - CCS'07 - Proceedings of the 14th ACM Conference on Computer and Communications Security
Y2 - 29 October 2007 through 2 November 2007
ER -