<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
><channel><title>R2S</title> <atom:link href="http://www.road2stat.com/cn/feed" rel="self" type="application/rss+xml" /><link>http://www.road2stat.com/cn</link> <description>江湖一散人</description> <lastBuildDate>Thu, 26 Jan 2012 08:18:36 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>2011</title><link>http://www.road2stat.com/cn/life/2011.html</link> <comments>http://www.road2stat.com/cn/life/2011.html#comments</comments> <pubDate>Sat, 31 Dec 2011 13:47:55 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[生活点滴]]></category> <category><![CDATA[2011]]></category> <category><![CDATA[2012]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=1011</guid> <description><![CDATA[2011的记忆从未消失过, 正如2011的承诺没有改变过明天. 希望在2012中, 多干活少吐槽, 本着什么都不靠只靠谱的原则, 继续靠谱下去.]]></description> <content:encoded><![CDATA[<p>2011的记忆从未消失过, 正如2011的承诺没有改变过明天.</p><p>希望在2012中, 多干活少吐槽, 本着什么都不靠只靠谱的原则, 继续靠谱下去.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/life/2011.html/feed</wfw:commentRss> <slash:comments>3</slash:comments> </item> <item><title>豆瓣评分计算策略的猜想</title><link>http://www.road2stat.com/cn/statistics/douban_rank.html</link> <comments>http://www.road2stat.com/cn/statistics/douban_rank.html#comments</comments> <pubDate>Sat, 31 Dec 2011 12:48:49 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[douban]]></category> <category><![CDATA[IMDB]]></category> <category><![CDATA[quantreg]]></category> <category><![CDATA[XML]]></category> <category><![CDATA[公式]]></category> <category><![CDATA[分位回归]]></category> <category><![CDATA[参数]]></category> <category><![CDATA[排序]]></category> <category><![CDATA[测度]]></category> <category><![CDATA[计算]]></category> <category><![CDATA[评分]]></category> <category><![CDATA[豆瓣]]></category> <category><![CDATA[豆瓣电影250]]></category> <category><![CDATA[距离]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=985</guid> <description><![CDATA[1 引 在九月短文 [1] 中, 我们对豆瓣电影评分的一个侧面有了简单认识. 其实, 我们对评分计算规则本身也是很感兴趣的. 这里以豆瓣电影为例作一简单猜想和分析, 音乐图书同理. 题中"策略"是相对"机制"来说的, 所指其实是比较具体的. 2 单个条目 有群众表示, 单个条目的评分计算只是对各个星级打分人数简单的加权平均, 由于页面上显示的评分结果满分是10分, 而打分时只有5个星级, 所以每个星级对应2分, 单个条目评分的计算公式即为: 评分 = (10 x 5星比例) + (8 x 4星比例) + ... + (2 x 1星比例) 抽取部分条目对此假设进行手工验证, 可以发现的确如此. 但是, 这里存在的一个陷阱是, 由于评分数据的特殊性和抽样的限制, &#8230; <a href="http://www.road2stat.com/cn/statistics/douban_rank.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/simpsons_movie.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/simpsons_movie.jpg" alt="simpsons_movie" title="simpsons_movie" width="500" height="325" class="aligncenter size-full wp-image-986" /></a></p><h2>1 引</h2><p>在九月短文 [1] 中, 我们对豆瓣电影评分的一个侧面有了简单认识. 其实, 我们对评分计算规则本身也是很感兴趣的. 这里以豆瓣电影为例作一简单猜想和分析, 音乐图书同理. 题中"策略"是相对"机制"来说的, 所指其实是比较具体的.</p><h2>2 单个条目</h2><p>有群众表示, 单个条目的评分计算只是对各个星级打分人数简单的加权平均, 由于页面上显示的评分结果满分是10分, 而打分时只有5个星级, 所以每个星级对应2分, 单个条目评分的计算公式即为:</p><p><code>评分 = (10 x 5星比例) + (8 x 4星比例) + ... + (2 x 1星比例)</code></p><p>抽取部分条目对此假设进行手工验证, 可以发现的确如此.</p><p>但是, 这里存在的一个陷阱是, 由于评分数据的特殊性和抽样的限制, 如果我们抽取一部分数据做回归, 结果可能会受到样本的影响而与手工验证的结果产生偏移. 由于1星(很差)和2星(较差)在大量条目样本中所占往往比例非常小, 普通的回归非常容易倾向于使X1, X2, X3的系数减小. 举例来说, 从<a href="http://movie.douban.com/people/road2stat/collect?sort=time&#038;mode=list" target="_blank">我看过</a>的443部电影中抽取前400个条目作为样本 <a href='http://www.road2stat.com/cn/wp-content/attachments/2011/12/rateSample.csv'>[rateSample.csv]</a> 作回归.</p><p><span id="more-985"></span></p><p>回归结果:<br /> <code>Call:<br /> lm(formula = Y ~ . - 1, data = rateSample)</p><p>Residuals:<br /> Min        1Q    Median        3Q       Max<br /> -0.248949 -0.036126  0.001084  0.042101  0.217110</p><p>Coefficients:<br /> Estimate Std. Error t value  Pr(>|t|)<br /> X5  9.97216    0.01884  529.436 <2e-16 ***<br /> X4  7.98297    0.03287  242.901 <2e-16 ***<br /> X3  5.64024    0.05587  100.962 <2e-16 ***<br /> X2  4.41272    0.29841   14.787 <2e-16 ***<br /> X1  1.38666    0.64675    2.144  0.0326 *<br /> ---<br /> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</p><p>Residual standard error: 0.06798 on 395 degrees of freedom<br /> Multiple R-squared: 0.9999,	Adjusted R-squared: 0.9999<br /> F-statistic: 1.053e+06 on 5 and 395 DF,  p-value: < 2.2e-16</code></p><p>由于这里的评分人数比例存在四舍五入现象, 所以每个条目5个星级的评分人数比例之和并不一定严格为1, 不过画图可知基本都处于[0.999, 1.001], 存在4个和为0.998, 1.002的样本, 不影响结果.</p><p>观察回归结果发现, 样本的这种特殊情况确对X3, X2, X1项有影响, 虽然检验结果是显著的, 但偏离了真实值6, 4, 2很远.</p><p>插句题外话, Box(是的, 就是你知道的那个Box)曾曰, <em>Statisticians, like artists, have the bad habit of falling in love with their models</em>.</p><p>爱不爱上模特的事情我不是很懂, 不过, 真的不要爱上模型. 原因么, 我们续写一下名句就知道了:</p><p><em>Models are always with their assumptions</em>.</p><p>题外话完毕. 使用平行坐标图展示一下这400部的评分数据:</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rate_para_coord.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rate_para_coord.png" alt="rate_para_coord" title="rate_para_coord" width="500" height="309" class="aligncenter size-full wp-image-988" /></a></p><p>可见大部分样本的整体评分状况. 1, 2星的数量远不及3, 4, 5星.</p><p>猜想这种情况可能是两种原因所致:</p><ul><li>豆瓣的电影推荐让用户更倾向于去看他人评价较高的影片, 越是评分较低的影片越无人光顾, 于是拉高了整体评分;</li><li>用户可能对那些质量一般的电影疏于打分, 而对自己喜欢的片子倾向于打分. 这就造成了客观上部分低分评分数据的缺失, 特别是在5分制(不同于IMDB的10分制)对于影片的区分度比较低的情况下.</li></ul><h2>3 豆瓣电影250榜单</h2><p>豆瓣电影和IMDB都有TOP 250榜单. 关于豆瓣电影250的计算方法, 之前已经有一些讨论 [2]. 一个有趣的问题是, 假设豆瓣的确使用了IMDB公式 [3] 计算得到此榜单, 可否由数据反演出公式中的两个参数?</p><p>其实, 这是一个以排序为因变量的回归问题, 既不同于传统的纯回归问题, 又不同于经典的排序学习问题. 事实上, 这个问题对两方面都提出了比较高的要求:</p><ol><li>对于传统的回归问题, 这里我们虽然想求得回归系数, 但目标变量是一种排序;</li><li>对于经典的排序学习问题, 这里我们虽然目标变量是排序, 但要求针对回归方程求出显式的回归系数;</li><li>如果转化为传统的分类问题, 信息会有比较大的损失.</li></ol><p>翻箱倒柜, 发现KDD10'的一片文章中, 来自Google的D. Sculley提出了一种方法 [4] (给出了现成工具 [5], 还有人port了R包 [6]) 来处理这类问题, 但由于这里的问题是非线性的, 不好直接处理.</p><p>不过不要对生活失去信心. 由于维度较低, 最终仍然有一种方法是可行的: 我们可以估计出参数所在的大致区间, 然后针对某种排序准确性的测度(这里单纯地采用了街区距离和欧氏距离), 暴力搜索这些区间组成的空间, 最后取排序结果最准确的点或点集. 按这个思路做了一下, 在<strong>使用这个公式的假设下</strong>, 可以估计取得数据时 (2011/12/26) 的参数C约为[6.0, 6.1], 参数m大致在[2900, 3100].</p><p>由于豆瓣电影250的榜单并非实时更新, 而榜单中的评分人数和得分却是实时更新的, 且网格的密度有限, 猜解结果理应存在误差. 我们猜测, 这种现实情况可能对于位于榜单后半部分的影片产生更强的影响, 而榜单上排名靠前的影片则会相对稳定. 对各个元素所在位置与其真实位次产生的偏移做分位回归 [7]:</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rank_shift_quantreg.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rank_shift_quantreg.png" alt="rank_shift_quantreg" title="rank_shift_quantreg" width="500" height="470" class="aligncenter size-full wp-image-989" /></a></p><p>红色虚线为最小二乘估计, 灰色实线为0.1, 0.25, 0.75, 0.9五个分位点处的分位回归估计, 黑色实线为0.5分位点.</p><p>看图说话:</p><ul><li>榜单偏后部分的偏移比前半部分稍强;</li><li>不同分位点的样本点偏移程度受其所在榜单位置的影响稍有不同.</li></ul><p>有兴趣的同学还可以跟踪观察榜单更新时得到的计算结果将有何变化.</p><h2>4 结</h2><p>有益的思考:</p><ol><li>单个条目的评分就是单纯的对5个得分进行加权平均. 此法尚有修正空间, 如果未考虑不同用户的评分权重, 则容易引入恶意评分问题;</li><li>豆瓣250榜单的计算可能借鉴了IMDB公式, 也可能对其设计进行了修改. 关于这个公式本身, 存在一些评论 [8], 或可对其进行修正;</li><li>在<strong>使用这个公式的假设下</strong>, 可以根据豆瓣电影250榜单的变化情况即时猜解参数, 从而了解当时(设定的)所有电影的平均分和上榜最低评分人数标准. 由这两个参数, 结合现有榜单所含信息, 我们可以在榜单更新延迟时, 提前推得某个条目在有一定评价人数(>m)时, 达到某个位置所需的最低得分; 或保持一定得分前提下, 分析上榜所需的最少评分人数.</li></ol><p>存在的问题:</p><ol><li>如果实际上未使用原始公式, 则以上估计是几乎没有什么意义的;</li><li>如果榜单有人为因素的干预, 例如只计算经常打分的用户的打分, 将对这种估计造成影响 [8];</li><li>排序准确性测度有待商榷;</li><li>这种解法虽然给出了全局最小, 但这个全局最小并不一定与真实参数等价, 真实参数也有可能隐匿在其它较小值的集合中.</li></ol><p>最后, 有代码有真相. <a href='http://www.road2stat.com/cn/wp-content/attachments/2011/12/dbrank.R'>[dbrank.R]</a></p><h2>参考</h2><p>[1] R2S. <a href="http://www.road2stat.com/cn/statistics/douban_rating.html" title="豆瓣用户对不同类型影片的打分是否真的有倾向性?" target="_blank">豆瓣用户对不同类型影片的打分是否真的有倾向性?</a></p><p>[2] 麻油四. <a href="http://www.douban.com/group/topic/2426734/" target="_blank">豆瓣250算法浅析</a>.</p><p>[3] Wikipedia. <a href="http://en.wikipedia.org/wiki/Internet_Movie_Database" target="_blank">Internet Movie Database</a>.</p><p>[4] D. Sculley. Combined Regression and Ranking. Proceedings of the 16th Annual SIGKDD Conference on Knowledge Discover and Data Mining, 2010.</p><p>[5] D. Sculley. <a href="http://code.google.com/p/sofia-ml/" target="_blank">sofia-ml</a> - Suite of Fast Incremental Algorithms for Machine Learning.</p><p>[6] Michael King and Fernando Cela Diaz. (2011). <a href="http://CRAN.R-project.org/package=RSofia" target="_blank">RSofia</a>: Port of sofia-ml to R.</p><p>[7] Roger Koenker (2011). <a href="http://CRAN.R-project.org/package=quantreg" target="_blank">quantreg</a>: Quantile Regression.</p><p>[8] <a href="http://www.azillionmonkeys.com/qed/imdbfix.shtml" target="_blank">Corrected IMDb Movie Rankings</a>.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/douban_rank.html/feed</wfw:commentRss> <slash:comments>4</slash:comments> </item> <item><title>冬青黑体 vs 华文细黑：叠加对比</title><link>http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html</link> <comments>http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html#comments</comments> <pubDate>Thu, 22 Dec 2011 13:05:43 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[光影之魅]]></category> <category><![CDATA[Hiragino Sans GB]]></category> <category><![CDATA[STXihei]]></category> <category><![CDATA[冬青黑体]]></category> <category><![CDATA[华文细黑]]></category> <category><![CDATA[字体]]></category> <category><![CDATA[对比]]></category> <category><![CDATA[苹果]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=973</guid> <description><![CDATA[北国冰城哈尔滨今年冬季是出奇的暖和, 再次提醒了我们距离2012的到来只剩下一整年, 仍然没有买到船票的同学们要抓紧时间了. 今天让我们叠加比较一下苹果的新旧主力中文字体: 冬青黑体(Hiragino Sans GB W3)和华文细黑(STXihei). 冬青黑体 = 红, 华文细黑 = 蓝. 简要总结: 同等字号下, 冬青黑体字面的确较华文细黑大, 可能有利于屏幕显示; 对笔锋的处理, 没有华文细黑那么夸张, 朴素多了; 冬青黑体在斜弯钩的收笔明显长于华文细黑, 同时压缩了右下角元素的比例, 整体张弛有度, 着墨更加均匀. References [1] Type is Beautiful. 雪豹新简体字体 Hiragino Sans GB. [2] 林泉约. 混乱的国标，不统一的“走”. [3] Wikipedia. Hiragino. &#8230; <a href="http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p>北国冰城哈尔滨今年冬季是出奇的暖和, 再次提醒了我们距离2012的到来只剩下一整年, 仍然没有买到船票的同学们要抓紧时间了. 今天让我们叠加比较一下苹果的新旧主力中文字体: 冬青黑体(Hiragino Sans GB W3)和华文细黑(STXihei).</p><p>冬青黑体 = 红, 华文细黑 = 蓝.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_chs.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_chs.png" alt="" title="hiragino_vs_xihei_chs" width="500" height="810" class="aligncenter size-full wp-image-974" /></a></p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_cht.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_cht.png" alt="" title="hiragino_vs_xihei_cht" width="500" height="810" class="aligncenter size-full wp-image-975" /></a></p><p>简要总结:</p><ol><li>同等字号下, 冬青黑体字面的确较华文细黑大, 可能有利于屏幕显示;</li><li>对笔锋的处理, 没有华文细黑那么夸张, 朴素多了;</li><li>冬青黑体在斜弯钩的收笔明显长于华文细黑, 同时压缩了右下角元素的比例, 整体张弛有度, 着墨更加均匀.</li></ol><h2>References</h2><p>[1] Type is Beautiful. <a href="http://www.typeisbeautiful.com/2010/01/1894" target="_blank">雪豹新简体字体 Hiragino Sans GB</a>.</p><p>[2] 林泉约. <a href="http://lethean.me/archives/299" target="_blank">混乱的国标，不统一的“走”</a>.</p><p>[3] Wikipedia. <a href="http://zh.wikipedia.org/wiki/Hiragino" target="_blank">Hiragino</a>.</p><p>[4] Lukhnos D. Liu. <a href="http://blog.lukhnos.org/post/195916082/hiragino-sans-gb-a-typeface-with-japanese-soul-and" target="_blank">Hiragino Sans GB: A typeface with Japanese soul and Simplified Chinese look</a>.</p><p>[5] 齐立. <a href="http://www.foundertype.com/index/stylist/ql.html" target="_blank">微软雅黑的设计</a>.</p><p>[6] 李少波. 黑体字研究: [博士学位论文]. 北京: 中央美术学院, 2008.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html/feed</wfw:commentRss> <slash:comments>6</slash:comments> </item> <item><title>OpenScholar是个好项目</title><link>http://www.road2stat.com/cn/life/openscholar.html</link> <comments>http://www.road2stat.com/cn/life/openscholar.html#comments</comments> <pubDate>Sat, 10 Dec 2011 17:58:31 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[生活点滴]]></category> <category><![CDATA[CMS]]></category> <category><![CDATA[Drupal]]></category> <category><![CDATA[IQSS]]></category> <category><![CDATA[OpenScholar]]></category> <category><![CDATA[内容管理系统]]></category> <category><![CDATA[学术]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=966</guid> <description><![CDATA[度过了一段史诗般的酒池肉林，华丽丽的两个月木有更新，直到我膝盖中了一箭。 两天前发现了OpenScholar这个项目，是几个IQSS的家伙鼓捣出来的，旨在为院系所实验室这样的研究机构提供一个快速构建大量个人和群体站点的平台，基于Drupal开发，自带了一些biblio这类模块，Google一下会发现还是有一些学校用户的。缺点是全局配置比较痛苦和繁琐，只用来建一个站有点奢侈了。不过非常喜欢它的自带主题，于是果断砍掉原来丑到不能看的静态主页，把长期不更新的页面稍微理顺了一下，太息曰：“内容管理系统，是所有建站者一生都无法逃脱的劫数。” 即使是小学生作文，也是要尽快写完的，十月的时候扔了两个草稿在那，已然忘光了。]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/openscholar.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/openscholar.png" alt="openscholar" title="openscholar" width="450" height="140" class="aligncenter size-full wp-image-968" /></a></p><p>度过了一段史诗般的酒池肉林，华丽丽的两个月木有更新，直到我膝盖中了一箭。</p><p>两天前发现了<a href="http://openscholar.harvard.edu/" target="_blank">OpenScholar</a>这个项目，是几个<a href="http://www.iq.harvard.edu/" target="_blank">IQSS</a>的家伙鼓捣出来的，旨在为院系所实验室这样的研究机构提供一个快速构建大量个人和群体站点的平台，基于Drupal开发，自带了一些biblio这类模块，Google一下会发现还是有一些学校用户的。缺点是全局配置比较痛苦和繁琐，只用来建一个站有点奢侈了。不过非常喜欢它的自带主题，于是果断砍掉原来丑到不能看的静态<a href="http://www.road2stat.com/" target="_blank">主页</a>，把长期不更新的页面稍微理顺了一下，太息曰：“内容管理系统，是所有建站者一生都无法逃脱的劫数。”</p><p>即使是小学生作文，也是要尽快写完的，十月的时候扔了两个草稿在那，已然忘光了。</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/life/openscholar.html/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Ten Typical Symptoms of Potential Academic Paranoia</title><link>http://www.road2stat.com/cn/statistics/academic_paranoia.html</link> <comments>http://www.road2stat.com/cn/statistics/academic_paranoia.html#comments</comments> <pubDate>Tue, 11 Oct 2011 16:42:26 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[academic]]></category> <category><![CDATA[Lisa Simpson]]></category> <category><![CDATA[paranoia]]></category> <category><![CDATA[symptoms]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=956</guid> <description><![CDATA[Getting used to writing articles that begin with a section named 'Introduction' or end up with section 'Conclusions'. Always cites several references in any type/length of essays; strongly believes that without the citations, the work will not be recognized by &#8230; <a href="http://www.road2stat.com/cn/statistics/academic_paranoia.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/10/Frink.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/10/Frink.png" alt="Prof.Frink" title="Prof.Frink" width="480" height="445" class="aligncenter size-full wp-image-958" /></a></p><ol><li>Getting used to writing articles that begin with a section named 'Introduction' or end up with section 'Conclusions'.</li><li>Always cites several references in any type/length of essays; strongly believes that without the citations, the work will not be recognized by anybody.</li><li>Hates magazines with huge pictures and imprecise textual materials; has a special fondness for two-column, small font, tight dissertations with formulas, three-line tables, and stylish,  dot-and-line formed scalable graphics.</li><li>Uses a reference manager, instead of regular tools such as Google Calendar, to organize daily life.</li><li>Blogs academic topics constantly for 2.5+ years, or has set up a stand-alone blog about  current research.</li><li>Talks academic in 50%+ Twitter/Facebook status in last 2 years, or has pure academic purpose social accounts.</li><li>Used to have at least one horrible nightmares about a B+ ruined perfect straight As, just like Lisa Simpson did.</li><li>Once encountered some data from the middle of nowhere, always considers what its underlying patterns look like; imagines constructing a quantitative model for it, very seriously.</li><li>When saw a problem, couldn't help diving into scholar databases to retrieve related papers, thoroughly read the references and dug recursively; Gigabytes of papers are storaged in the hard-drive eventually.</li><li>Blogs academic paranoia and doesn't feel anything, until now.</li></ol> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/academic_paranoia.html/feed</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>R连接PostgreSQL</title><link>http://www.road2stat.com/cn/r_language/rpostgresql.html</link> <comments>http://www.road2stat.com/cn/r_language/rpostgresql.html#comments</comments> <pubDate>Mon, 03 Oct 2011 19:30:46 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[R]]></category> <category><![CDATA[DBI]]></category> <category><![CDATA[PostgreSQL]]></category> <category><![CDATA[数据库]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=949</guid> <description><![CDATA[最近一直在玩DICE三年前的神作《镜之边缘》, 顺便重温了一下一年前的《黑手党II》, 玩得简直是没有什么时间上来灌水了. 游戏之余偶然接触了一个PostgreSQL数据库, 简单记录一下. R连接数据库有几套方案, 其实基本上就是DBI/ODBC/JDBC. 不过话说ODBC和JDBC神马的真是弱爆了. JDBC方案中那个鬼魂一般的依赖rJava, 真的是很难安装. 其实也有一种可能是AUR上的JDK打包得不好, 没能hold住R CMD javareconf的标准. 前些日子安装RWeka时专门研究过rJava的安装脚本, 卡在编译简单JNI程序这句一直不成功, 手动修改各种配置文件无果, 于是果断放弃 ... 吐槽完毕, 顺便拉回正题. 话说这PostgreSQL是伯克利出品, 基于以自己名字命名的协议发布(霸气又外露了), 有着众多优良特性. 其实我总想把这名字读成Post·GRE·SQL, 不难译为"旧GRE的结构化查询语言", 只是, 您这名字让新泽西乡下那儿一心一意革新GRE考试的大爷们情何以堪哪 ... 当然, 最后结果都是万把个英文单词乱入. 同时, 我们选择的RPostgreSQL包是GSoC 08'项目, 有R社区Dirk Eddelbuettel等众牛参与, 正牌DBI系. 另外, Bioconductor项目的用户也贡献了同为DBI系的RdbiPgSQL/pgUtils包. 酌情使用. &#8230; <a href="http://www.road2stat.com/cn/r_language/rpostgresql.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p>最近一直在玩DICE三年前的神作《镜之边缘》, 顺便重温了一下一年前的《黑手党II》, 玩得简直是没有什么时间上来灌水了. 游戏之余偶然接触了一个PostgreSQL数据库, 简单记录一下.<br /> <a href="http://www.road2stat.com/cn/wp-content/attachments/2011/10/mirrorsedge.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/10/mirrorsedge.jpg" alt="mirrorsedge" title="mirrorsedge" width="500" height="336" class="aligncenter size-full wp-image-950" /></a></p><p>R连接数据库有几套方案, 其实基本上就是DBI/ODBC/JDBC. 不过话说ODBC和JDBC神马的真是弱爆了. JDBC方案中那个鬼魂一般的依赖rJava, 真的是很难安装. 其实也有一种可能是AUR上的JDK打包得不好, 没能hold住R CMD javareconf的标准. 前些日子安装RWeka时专门研究过rJava的安装脚本, 卡在编译简单JNI程序这句一直不成功, 手动修改各种配置文件无果, 于是果断放弃 ...<br /> <span id="more-949"></span><br /> 吐槽完毕, 顺便拉回正题. 话说这PostgreSQL是伯克利出品, 基于以自己名字命名的协议发布(霸气又外露了), 有着<a href="http://obmem.info/?p=493" target="_blank">众多优良特性</a>. 其实我总想把这名字读成Post·GRE·SQL, 不难译为"旧GRE的结构化查询语言", 只是, 您这名字让新泽西乡下那儿一心一意革新GRE考试的大爷们情何以堪哪 ... 当然, 最后结果都是万把个英文单词乱入. 同时, 我们选择的RPostgreSQL包是GSoC 08'项目, 有R社区Dirk Eddelbuettel等众牛参与, 正牌DBI系.</p><p>另外, Bioconductor项目的用户也贡献了同为DBI系的RdbiPgSQL/pgUtils包. 酌情使用.</p><pre># For Arch Linux
# 安装PostgreSQL Server
$ sudo pacman -S postgresql
# 启动daemon
$ sudo /etc/rc.d/postgresql start
# 创建用户
$ sudo createuser -s -U postgres
# psql是个好工具
$ psql -l
# 创建数据库
$ createdb newdatabase
# 导入pg_dump文件
$ psql -d newdatabase -U postgres -f dump.sql</pre><p>导入数据的速度还可以, 700多M的pg_dump文件导进去只消几分钟, 空间占用也涨到了2G+. PostgreSQL的使用问题在<a href="https://wiki.archlinux.org/index.php/PostgreSQL" target="_blank">Arch Wiki</a>上有详细说明. 日常管理方面, 力荐一个自由的, 同样遵守PostgreSQL协议的跨平台GUI工具pgAdmin:</p><pre>sudo pacman -S pgadmin3</pre><p>R部分非常容易:</p><pre>require(RPostgreSQL)
# 读入driver
drv = dbDriver("PostgreSQL")
# 填写连接信息
con = dbConnect(drv, dbname = "数据库名",
user = "用户名", password = "密码", port = 5432)
# 查询语句
rs = dbSendQuery(con, statement = "SQL语句")
# 收割结果
df = fetch(rs, n = -1)
# 其实可以直接执行查询返回结果
dbGetQuery(con, "SQL语句")
# 断开连接
dbDisconnect(con)
# 释放资源
dbUnloadDriver(drv)</pre><p>写数据时, 可能会遇到数据类型字符编码等等RP问题. 更多细节还是关注一下文档吧, 话说牛人们往往都是懒得写vignette的. 只有函数reference manual可读的用户你是真真的伤不起啊.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/r_language/rpostgresql.html/feed</wfw:commentRss> <slash:comments>6</slash:comments> </item> <item><title>Visualizing CRAN Package Dependency Network: Reveal Hidden Patterns with Martin Krzywinski&#039;s Hive Panel</title><link>http://www.road2stat.com/cn/statistics/hivepanel.html</link> <comments>http://www.road2stat.com/cn/statistics/hivepanel.html#comments</comments> <pubDate>Thu, 29 Sep 2011 12:06:10 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[community]]></category> <category><![CDATA[CRAN]]></category> <category><![CDATA[dependency]]></category> <category><![CDATA[hive panel]]></category> <category><![CDATA[hive plots]]></category> <category><![CDATA[network]]></category> <category><![CDATA[package]]></category> <category><![CDATA[R]]></category> <category><![CDATA[Visualization]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=937</guid> <description><![CDATA[1 Introduction Studying the networks of online software community is fascinating. CPAN Explorer is a typical project aiming at analyzing the relationships in CPAN community [1]. CRAN package dependency network is another excellent source for this type of research. A &#8230; <a href="http://www.road2stat.com/cn/statistics/hivepanel.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<h1>1 Introduction</h1><p>Studying the networks of online software community is fascinating. <a href="http://cpan-explorer.org/" target="_blank">CPAN Explorer</a> is a typical project aiming at analyzing the relationships in CPAN community [1]. CRAN package dependency network is another excellent source for this type of research. A state-of-art visualization is usually required to understand the network [2].</p><p>A common problem of conventional hairball style network visualization is: the graph becomes uninterpretable when it meets very large networks [3]. Researchers developed techniques such as hierarchical edge bundles [4] to tackle this problem. However, that's just too ideal for real world visualization problems. When it's emphasizing the strong connections in the network, the less strong part and the key details could possibly be ignored. Conventional visualization methods have constrained us to take a further step: revealing more hidden information of the internal structure (vertices, connectivity, etc.) in the network.</p><h1>2 Hive Plots</h1><p>Martin Krzywinski, author of the circular style genome visualization tool circos, proposed the hive plots in 2010 [5]. The most significant difference between hive plots and traditional layout is: its graphic design is based on the network's meaningful properties (vertices' degree, connectivity, centrality, etc.) instead of aesthetics. This design makes the graph interpretable and thus simplifies the presentation of relational data.</p><h1>3 The Visualization</h1><p>We selected 27 representative packages and visualize every three of them in one hive plot to make a 3x3 hive panel. Each panel represents a specific research field. Each node of the network is mapped on the axes by its degree information: green axis represents out-degree, orange axis represents in-degree, and purple axis combines in/out-degrees together. On each axis, outer nodes have higher degrees. The white connections, as the background, show us the overall connectivity of the network: the nodes have higher out-degrees are heavily depended by all ranges of nodes in the network, and the brighter parts of the arcs tend to indicate potential cluster patterns.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/hiveplot.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/hiveplot.png" alt="hiveplot" title="hiveplot" width="500" height="500" class="aligncenter size-full wp-image-938" /></a></p><blockquote><p><a href="http://www.flickr.com/photos/road2stat/6194849428/" target="_blank">Click here to see a larger version.</a></p></blockquote><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/hivepanel.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/hivepanel.png" alt="hivepanel" title="hivepanel" width="500" height="500" class="aligncenter size-full wp-image-939" /></a></p><blockquote><p><a href="http://www.flickr.com/photos/road2stat/6194849550/" target="_blank">Click here to see a larger version.</a></p></blockquote><p>Meanwhile, we highlight three of the interested packages in each research field in one panel with three different colors to reveal its specific connection patterns. For the first panel, green connections represents <strong>lattice</strong> package. It's a fundamental package for graphic design in R, which is heavily depended by packages of all degrees. The purple connections represent the <strong>rgl</strong> package. It depends a little but it's depended by much more packages that distributed more discretely on the orange axis than <strong>lattice</strong> was. Orange lines represent the <strong>gplots</strong> package, which contains various miscellaneous tools for plotting. Obviously, the dependency patterns indicate its different role between the previous ones: it's more of a handy toolset for plotting, rather than a core package. The upper right panel shows us three of the data import/export packages: <strong>DBI</strong>, <strong>RODBC</strong> and <strong>RSQLite</strong>. Amazingly, althought they play different roles in the whole community, their dependency patterns are almost the same, except for a little difference between their degrees. The central panel, which highlights the finance-related packages <strong>fBasics</strong>, <strong>fOptions</strong>, and <strong>fGarch</strong>, reveals similar features.</p><p>Hive plots are relatively much more informative and comprehensive than conventional hairball-style visualizations, especially for large networks. You could discover much more interesting patterns in other panels yourself with this visualization.</p><p>The selected packages (ordered by panel 11, 12, 13, 21, 22 …) are:<ul><li>Graphics: lattice / rgl / gplots (Green / Purple / Orange)</li><li>Programming: tools / rJava / Rcpp</li><li>Data Import/Export: DBI / RODBC / RSQLite</li><li>GUI Dev Tools &#038; Framework: tcltk / gWidgets / Rcmdr</li><li>Finance: fBasics / fOptions / fGarch</li><li>Machine Learning: e1071 / rpart / randomForest</li><li>Regression Analysis: car / leaps / quantreg</li><li>Spatial and Geo Statistics: sp / maps / fields</li><li>Time Series Analysis: forecast / timeDate / tseries</li></ul><h1>4 Details</h1><p>The creation of this visualization is really simple; highly reproducible for anyone who has a little knowledge of SNA [6]:</p><ol><li>The original data was retrieved from<br /> <a href="http://cran.r-project.org/bin/windows/contrib/2.13/PACKAGES" target="_blank">http://cran.r-project.org/bin/windows/contrib/2.13/PACKAGES</a><br /> on September 14, 2011. We only extracted the 'Depends' section of each package. After parsing and a bit of cleaning, a network consisted of <strong>2,500</strong> vertices and <strong>5,900</strong> arcs was constructed.</li><li>To shrink the network, perform k-core analysis and extract the 4-6 cores partition to form a new network, a denser one, with less noise. Now it's reduced to about <strong>600</strong> vertices and <strong>2,500</strong> arcs.</li><li>Draw the shrinked network permuted by degree information with Martin's linnet tool. Each single panel implies a package's degree and dependency distribution properties. Combine the 9 separated hive plots to form a complete hive panel.</li></ol><h1>References</h1><p>[1] Julian Bilcke. CPAN Explorer - An Interactive Exploration of the Perl Ecosystem. <a href="http://cpan-explorer.org/" target="_blank">http://cpan-explorer.org/</a>, 2009.<br /> [2] Xiao Nan. R2S - PKU Vis Summer School. <a href="http://www.road2stat.com/cn/statistics/pku_vis_summer_school.html" target="_blank">http://www.road2stat.com/cn/statistics/pku_vis_summer_school.html</a>, 2010.<br /> [3] Koon-Kiu Yana, Gang Fanga, Nitin Bhardwaja, Roger P. Alexandera, Mark Gerstein. Comparing Genomes to Computer Operating Systems in Terms of the Topology and Evolution of their Regulatory Control Networks. Proceedings of the National Academy of Sciences, 107 (20): 9186 - 9191, 2006.<br /> [4] Danny Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Transactions on Visualization and Computer Graphics (TVCG; Proceedings of Vis/InfoVis 2006), Vol. 12, No. 5, 741 - 748, 2006.<br /> [5] Martin Krzywinski. Hive Plots - Linear Layout for Network Visualization - Visually Interpreting Network Structure and Content Made Possible. <a href="http://www.hiveplot.com/" target="_blank">http://www.hiveplot.com/</a>, 2010.<br /> [6] Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj. Exploratory Social Network Analysis with Pajek. Cambridge University Press, 2005.<br /> [7] J.R. Heard. World Economic Forum Hive Plot. <a href="http://www.visualizing.org/visualizations/world-economic-forum-hive-plot/" target="_blank">http://www.visualizing.org/visualizations/world-economic-forum-hive-plot/</a>, 2010.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/hivepanel.html/feed</wfw:commentRss> <slash:comments>5</slash:comments> </item> <item><title>豆瓣用户对不同类型影片的打分是否真的有倾向性?</title><link>http://www.road2stat.com/cn/statistics/douban_rating.html</link> <comments>http://www.road2stat.com/cn/statistics/douban_rating.html#comments</comments> <pubDate>Tue, 06 Sep 2011 16:27:28 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[Mann-Whitney检验]]></category> <category><![CDATA[stripplot]]></category> <category><![CDATA[Wilcoxon检验]]></category> <category><![CDATA[倾向性]]></category> <category><![CDATA[小提琴图]]></category> <category><![CDATA[恐怖片]]></category> <category><![CDATA[打分]]></category> <category><![CDATA[核密度估计]]></category> <category><![CDATA[箱线图]]></category> <category><![CDATA[豆瓣]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=920</guid> <description><![CDATA[1 起 在豆瓣上为数不少的恐怖/惊悚片的讨论中, 我们常常可以发现类似于这样的说法 [1]: 像这部片子 也有吸引人看下去的地方 为什么分数总是那么低? 那么, 豆瓣用户对这类影片的打分上, 是否真的存在普遍低于其他类型的影片的情况? 为了验证这个猜想, 我们不妨利用豆瓣提供的评分数据来简单分析一下. 2 承 首先明确问题的定义. 这里我们不去比较恐怖片和其他所有类型片的总体. 其实, 更让人感兴趣的问题是, 将恐怖片与其他同级别类型的影片分别进行两两比较, 结果会如何. 豆瓣的电影条目是采用tag来进行分类的. 此时样本的选取成了一个问题. 总的来说, 要保证各类型影片的类型特征区别要尽量大, 比如恐怖片和惊悚片之间的差别没有恐怖片和励志片的差别明显, 又如有可能一部影片既有"爱情"标签, 也有"喜剧"标签, 也就是说, 各类型的影片将存在交集. 同时, 也要保证各类样本在其他方面的差别尽量小, 如不同类型影片的总体规模差距不能过于悬殊等等. 我是这样做的. 2.1 选取分类 参考IMDB的Genre分类, 在豆瓣电影标签页面选取8个比较主流的Genre, 标准是各类含有个数的差别不能过大, 因此剔除了动作/爱情/科幻等远超过100万部的类别. &#8230; <a href="http://www.road2stat.com/cn/statistics/douban_rating.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<h1>1 起</h1><p>在豆瓣上为数不少的恐怖/惊悚片的讨论中, 我们常常可以发现类似于这样的说法 [1]:</p><blockquote><p>像这部片子<br /> 也有吸引人看下去的地方<br /> 为什么分数总是那么低?</p></blockquote><p>那么, 豆瓣用户对这类影片的打分上, 是否真的存在普遍低于其他类型的影片的情况? 为了验证这个猜想, 我们不妨利用豆瓣提供的评分数据来简单分析一下.</p><h1>2 承</h1><p>首先明确问题的定义. 这里我们不去比较恐怖片和其他所有类型片的总体. 其实, 更让人感兴趣的问题是, 将恐怖片与其他同级别类型的影片分别进行两两比较, 结果会如何.</p><p>豆瓣的电影条目是采用tag来进行分类的. 此时样本的选取成了一个问题. 总的来说, 要保证各类型影片的类型特征区别要尽量大, 比如恐怖片和惊悚片之间的差别没有恐怖片和励志片的差别明显, 又如有可能一部影片既有"爱情"标签, 也有"喜剧"标签, 也就是说, 各类型的影片将存在交集. 同时, 也要保证各类样本在其他方面的差别尽量小, 如不同类型影片的总体规模差距不能过于悬殊等等.</p><p>我是这样做的.</p><p><span id="more-920"></span></p><h2>2.1 选取分类</h2><p>参考IMDB的<a href="http://www.imdb.com/genre" target="_blank">Genre分类</a>, 在<a href="http://movie.douban.com/tag/" target="_blank">豆瓣电影标签</a>页面选取8个比较主流的Genre, 标准是各类含有个数的差别不能过大, 因此剔除了动作/爱情/科幻等远超过100万部的类别. 同时, 各类之间的类型区别要明晰. 这8类和各类中的条目数分别为:</p><blockquote><p>纪录 (592406)<br /> 励志 (579749)<br /> 犯罪 (545789)<br /> 恐怖 (524562)<br /> 科幻 (1463871)<br /> 战争 (480216)<br /> 文艺 (471861)<br /> 动画短片 (432140)</p></blockquote><h2>2.2 提取条目链接</h2><p>各类按标注次数排序(<code>type=O</code>), 各类分别提取1600部(20部/页 x 80页)的条目链接:</p><div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p920code3'); return false;">View Code</a> RSPLUS</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p9203"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code" id="p920code3"><pre class="rsplus" style="font-family:monospace;"><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/require.html"><span style="color: #0000FF; font-weight: bold;">require</span></a><span style="color: #080;">&#40;</span>XML<span style="color: #080;">&#41;</span>
idParser <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/function.html"><span style="color: #0000FF; font-weight: bold;">function</span></a><span style="color: #080;">&#40;</span>genre, <span style="color: #0000FF; font-weight: bold;">start</span><span style="color: #080;">&#41;</span> <span style="color: #080;">&#123;</span>
	<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/url.html"><span style="color: #0000FF; font-weight: bold;">url</span></a> <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;http://movie.douban.com/tag/&quot;</span>, 
                    genre, <span style="color: #ff0000;">&quot;?start=&quot;</span>, <span style="color: #0000FF; font-weight: bold;">start</span>, <span style="color: #ff0000;">&quot;&amp;type=O&quot;</span>, sep <span style="color: #080;">=</span> <span style="color: #ff0000;">''</span><span style="color: #080;">&#41;</span>
	idPage <span style="color: #080;">=</span> htmlTreeParse<span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/url.html"><span style="color: #0000FF; font-weight: bold;">url</span></a>, useInternal <span style="color: #080;">=</span> TRUE<span style="color: #080;">&#41;</span>
	x <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/rep.html"><span style="color: #0000FF; font-weight: bold;">rep</span></a><span style="color: #080;">&#40;</span>NA, <span style="color: #ff0000;">20</span><span style="color: #080;">&#41;</span>
	<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/for.html"><span style="color: #0000FF; font-weight: bold;">for</span></a> <span style="color: #080;">&#40;</span>i <span style="color: #0000FF; font-weight: bold;">in</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/seq.html"><span style="color: #0000FF; font-weight: bold;">seq</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">22</span>, <span style="color: #ff0000;">61</span>, <span style="color: #ff0000;">2</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span> <span style="color: #080;">&#123;</span>
	x<span style="color: #080;">&#91;</span><span style="color: #080;">&#40;</span>i<span style="color: #080;">/</span><span style="color: #ff0000;">2</span><span style="color: #080;">-</span><span style="color: #ff0000;">10</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span> <span style="color: #080;">=</span> xpathApply<span style="color: #080;">&#40;</span>idPage, path <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;//a&quot;</span>, 
                                 xmlGetAttr, <span style="color: #ff0000;">&quot;href&quot;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #080;">&#91;</span>i<span style="color: #080;">&#93;</span><span style="color: #080;">&#93;</span>
	<span style="color: #080;">&#125;</span>
	<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/return.html"><span style="color: #0000FF; font-weight: bold;">return</span></a><span style="color: #080;">&#40;</span>x<span style="color: #080;">&#41;</span>
<span style="color: #080;">&#125;</span></pre></td></tr></table></div><p>这里考虑了评价人数的问题: 评分人数过少的条目不予入选. 豆瓣的评分计算方法中虽然像IMDB一样考虑了评分人数多少的问题 -- 评分人数少的条目虽然得分不会特别高或特别低, 但从反映主流观影口味角度, 选取观看人数较少的影片毕竟存在着较大风险. 同时不难发现, 每类在1600条处基本都有1000+左右的标注, 评分人数能够得到保证, 可以认为样本的评分比较公正.</p><h2>2.3 剔除冗余条目</h2><p>8类之间两两比较, 首先剔除本类中重复的条目(的确存在!), 然后剔除掉同时属于多类的条目. 此时每类剩余约1100 - 1300部互无任何交集的影片.</p><h2>2.4 抽样</h2><p>在剩余样本中, 每类随机抽取1000个条目.</p><h2>2.5 读取评分</h2><p>利用豆瓣API [2] 读取1000 x 8 = 8000个评分数据(平均分).</p><div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p920code4'); return false;">View Code</a> RSPLUS</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p9204"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p920code4"><pre class="rsplus" style="font-family:monospace;">rateParser <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/function.html"><span style="color: #0000FF; font-weight: bold;">function</span></a><span style="color: #080;">&#40;</span>id<span style="color: #080;">&#41;</span> <span style="color: #080;">&#123;</span>
	<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/url.html"><span style="color: #0000FF; font-weight: bold;">url</span></a> <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;http://api.douban.com/movie/subject/&quot;</span>, id, 
				<span style="color: #ff0000;">&quot;?apikey={your api key}&quot;</span>, sep <span style="color: #080;">=</span> <span style="color: #ff0000;">''</span><span style="color: #080;">&#41;</span>
	moviePage <span style="color: #080;">=</span> htmlTreeParse<span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/url.html"><span style="color: #0000FF; font-weight: bold;">url</span></a>, useInternal <span style="color: #080;">=</span> TRUE<span style="color: #080;">&#41;</span>
	<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/return.html"><span style="color: #0000FF; font-weight: bold;">return</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/as.numeric.html"><span style="color: #0000FF; font-weight: bold;">as.<span style="">numeric</span></span></a><span style="color: #080;">&#40;</span>xpathApply<span style="color: #080;">&#40;</span>moviePage, 
	path <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;//rating&quot;</span>, xmlGetAttr, <span style="color: #ff0000;">&quot;average&quot;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #080;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#125;</span></pre></td></tr></table></div><p>读数据时要注意, 即使申请了豆瓣的API key以后也有40次请求/分钟的限制. 我用的是最简单的方法: 把请求嵌入到循环中, 每请求一次后<code>Sys.sleep(1.5)</code>一下, 以保证每分钟请求不会超过40次. 我们可以在shell中开一个<code>tail -f filename</code>来实时监视文件写入情况.</p><h1>3 转</h1><p>有了数据先画图.</p><h2>3.1 散点图</h2><p>先画个散点图(由于是一维变量, 可称为strip plot)热身, 看一下各类评分的分布情况. 由于给出的评分值只精确到小数点后一位, 重复值很多, 先在原始数据上增加随机值把点打散, 同时使用alpha混合让点部分透明, 以争取避免叠加时产生的问题.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/stripplot.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/stripplot.png" alt="stripplot" title="stripplot" width="504" height="504" class="aligncenter size-full wp-image-921" /></a></p><p>纪录片的绝大部分平均分都集中在7以上, 明显有别于其他Genre. 而8分以上的恐怖片则显得比较稀有. 所以, 在豆瓣看到8分以上的恐怖片, 就收了吧 ... 同时可以发现有一部励志片得分为0(事实上, 评价人数也为0)却能在tag标记人数方面上榜, 是哪部影片如此诡异呢? 此处略去不表 ...</p><p>由于样本量比较大, overplotting的情况比较严重, 散点图的效果实际上非常一般, 能够提取的信息有限.</p><h2>3.2 核密度估计</h2><p>对评分作核密度估计后画图.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/densityplot.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/densityplot.png" alt="densityplot" title="densityplot" width="504" height="648" class="aligncenter size-full wp-image-922" /></a></p><p>各panel中的评分密度基本上都是单峰, 形状大致相同. 战争片/励志片是如此类似, 纪录片和文艺片也十分相像, 只是纪录片峰值处对应的评分要稍高于文艺片. 所以, 在豆瓣看到很多8-9分的纪录片, 不要惊讶.</p><h2>3.3 箱线图</h2><p>作箱线图.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/bwplot.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/bwplot.png" alt="bwplot" title="bwplot" width="504" height="504" class="aligncenter size-full wp-image-923" /></a></p><p>可以看到带有文艺/犯罪标签的影片的评分分布最为稳定, 离群点很少, 可能也在一定程度上说明了这类影片的同质化比较严重. 恐怖片的中位数位置的确低于其他类型的影片, 高分也很少.</p><h2>3.4 小提琴图</h2><p>作小提琴图.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/violinplot.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/violinplot.png" alt="violinplot" title="violinplot" width="504" height="504" class="aligncenter size-full wp-image-924" /></a></p><p>小提琴图综合了箱线图和核密度估计图的特点, 不过在这里的效果不算很理想, 不多说了.</p><h1>4 合</h1><p>最后, 是欢乐的统计检验环节. 由于分布的具体类型未知, 我们也不好随便假设, 所以还是用非参方法检验中位数吧.</p><p>设<br /> u_1: 恐怖片评分的中位数<br /> u_2: 其他某类影片评分的中位数<br /> H_0: 豆瓣用户对恐怖片的打分与其他类无区别 u_1 = u_2<br /> H_1: 豆瓣用户对恐怖片的打分比其他某类偏低 u_1 < u_2</p><p>作两样本Wilcoxon(Mann-Whitney)检验, 该检验需要的唯一假定就是两个总体的分布有类似的形状. 由前述核密度估计图以及小提琴图可知, 各总体分布的形状的确大致相同, 可以认为满足要求.</p><p>结果呢?</p><p>7次检验都拒绝了原假设.</p><p>恐怖片导演的确不好混, 一方面是想要拍出好的恐怖片的确不容易, 而且一不小心就被打上B级/cult的烙印. 知名导演很少会去拍纯粹的恐怖片, 尤其是多年以前的《电锯惊魂》这样的小成本惊悚/恐怖片出现以后, 近年涌现了越来越多的小制作影片和新人导演试水惊悚/恐怖类影片. 另一方面, 是不是人在看了比较精彩的恐怖片以后, 也被吓得倾向于打低分了呢? 不能不说, 这是一个有趣的问题.</p><h1>参考</h1><p>[1] <a href="http://www.douban.com/group/topic/9434945/" target="_blank">为什么恐怖片在豆瓣上的评分普遍很低</a>.<br /> [2] <a href="http://www.douban.com/service/apidoc/reference/" target="_blank">豆瓣API参考手册</a>.<br /> [3] Deepayan Sarkar. Lattice: Multivariate Data Visualization with R. pp. 37-52.<br /> [4] 吴喜之. 统计学: 从数据到结论. pp. 266-269.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/douban_rating.html/feed</wfw:commentRss> <slash:comments>7</slash:comments> </item> <item><title>战神：斯巴达幽灵</title><link>http://www.road2stat.com/cn/life/god_of_war.html</link> <comments>http://www.road2stat.com/cn/life/god_of_war.html#comments</comments> <pubDate>Sun, 04 Sep 2011 13:20:09 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[生活点滴]]></category> <category><![CDATA[psp]]></category> <category><![CDATA[战神]]></category> <category><![CDATA[斯巴达幽灵]]></category> <category><![CDATA[评论]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=915</guid> <description><![CDATA[“与其临渊羡鱼，不如隔岸观火；与其坐而论道，不如纸上谈兵。” 看得出，为了超越08年的《战神：奥林匹斯之链》，索尼在这一作上还是下了功夫的。在希腊神话的基础上构建出这样一个游戏，可以说是比较出色了。尤其是还能将亚特兰蒂斯/弥达斯等等现实传说与Kratos的故事联系在一起，想象非常合理，整个作品也无愧于两年的等待，一款上乘佳作。 故事脚本方面，较前作更让人信服。主题由复仇进化为救赎，尤其是关于弟弟Deimos的情节使用了大量的闪回：在母上变为怪兽之前的对话，进入斯巴达城后和幼年Deimos的战斗，以及最后阿瑞斯神殿中和幼年Kratos的乱斗。虽说在人物关系上有较多的生搬硬套之处，整个故事大纲倒还是可圈可点的。 战斗难度上，与上作基本持平，没有遇到太困难的关卡，有些场景几乎是一气呵成。除了那一场dog fight，破坏大齿轮用了相对长的时间外，其他场景的战斗都很流畅。但武器系统依然是乏善可陈，换汤不换药，只是在雅典娜神剑上增加了一个会不断消耗和增长的席拉火焰，设计得比较天真，不够cool。 道具解谜比前作圡多了：没有出现那么多的“高科技”道具，而是更加朴实又强迫你思考。让人印象比较深的是一个道具多次使用的设计。同时，前进路径和道具位置也更加隐蔽。道具之间的衔接、各类机关的物理原理都有所创新。值得一提的是，本作需要绳索通过和峭壁攀援的场景明显增多，可以说是突出了“深渊”二字：老是悬空跳来跳去，不得不说都有点像波斯王子/刺客信条了。 场景方面，空间尺度比前作要宏大很多，上天入地还有海底一日游，冰火相间，层次感非常强。场景转换节奏较上一作有所加快，连贯性得到了加强，几乎没有前作那样在一个多层神殿内来回探索的事情发生，不过几次坍塌逃生桥段未免有些雷同。颜色运用上，有偏冷的和死神女儿厄里倪厄斯战斗的山脉和Domain of Death的塔纳托斯神殿，也有色调偏暖的火山和斯巴达城。总体来说颜色运用很舒服，在完成比前作长很多的整个流程以后，眼睛也不会感到太累。 配乐中规中矩，没有亮点也没有明显的缺憾。但Kratos的配音值得批评，沙哑深沉得有点做作。 当然，《战神》整个故事的精髓在于根植于Kratos内心深处的反抗精神，正如他和NPC的对话："I don't want to be god. The gods could take the honor back." 虽说奎爷总是反复强调这一点，未免有说教之嫌。不过，我始终想吐槽的是，制作人员竟然把前作存档时“众神决定给你一次机会，是否存档”这么幽默的语言去掉了，不能不说是一个遗憾。]]></description> <content:encoded><![CDATA[<p>“与其临渊羡鱼，不如隔岸观火；与其坐而论道，不如纸上谈兵。”</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/kratos.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/kratos.jpg" alt="kratos" title="kratos" width="500" height="281" class="aligncenter size-full wp-image-916" /></a></p><p>看得出，为了超越08年的《战神：奥林匹斯之链》，索尼在这一作上还是下了功夫的。在希腊神话的基础上构建出这样一个游戏，可以说是比较出色了。尤其是还能将亚特兰蒂斯/弥达斯等等现实传说与Kratos的故事联系在一起，想象非常合理，整个作品也无愧于两年的等待，一款上乘佳作。</p><p>故事脚本方面，较前作更让人信服。主题由复仇进化为救赎，尤其是关于弟弟Deimos的情节使用了大量的闪回：在母上变为怪兽之前的对话，进入斯巴达城后和幼年Deimos的战斗，以及最后阿瑞斯神殿中和幼年Kratos的乱斗。虽说在人物关系上有较多的生搬硬套之处，整个故事大纲倒还是可圈可点的。</p><p>战斗难度上，与上作基本持平，没有遇到太困难的关卡，有些场景几乎是一气呵成。除了那一场dog fight，破坏大齿轮用了相对长的时间外，其他场景的战斗都很流畅。但武器系统依然是乏善可陈，换汤不换药，只是在雅典娜神剑上增加了一个会不断消耗和增长的席拉火焰，设计得比较天真，不够cool。</p><p>道具解谜比前作圡多了：没有出现那么多的“高科技”道具，而是更加朴实又强迫你思考。让人印象比较深的是一个道具多次使用的设计。同时，前进路径和道具位置也更加隐蔽。道具之间的衔接、各类机关的物理原理都有所创新。值得一提的是，本作需要绳索通过和峭壁攀援的场景明显增多，可以说是突出了“深渊”二字：老是悬空跳来跳去，不得不说都有点像波斯王子/刺客信条了。</p><p>场景方面，空间尺度比前作要宏大很多，上天入地还有海底一日游，冰火相间，层次感非常强。场景转换节奏较上一作有所加快，连贯性得到了加强，几乎没有前作那样在一个多层神殿内来回探索的事情发生，不过几次坍塌逃生桥段未免有些雷同。颜色运用上，有偏冷的和死神女儿厄里倪厄斯战斗的山脉和Domain of Death的塔纳托斯神殿，也有色调偏暖的火山和斯巴达城。总体来说颜色运用很舒服，在完成比前作长很多的整个流程以后，眼睛也不会感到太累。</p><p>配乐中规中矩，没有亮点也没有明显的缺憾。但Kratos的配音值得批评，沙哑深沉得有点做作。</p><p>当然，《战神》整个故事的精髓在于根植于Kratos内心深处的反抗精神，正如他和NPC的对话："I don't want to be god. The gods could take the honor back." 虽说奎爷总是反复强调这一点，未免有说教之嫌。不过，我始终想吐槽的是，制作人员竟然把前作存档时“众神决定给你一次机会，是否存档”这么幽默的语言去掉了，不能不说是一个遗憾。</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/life/god_of_war.html/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Rapid Prototyping R based Web Applications with Rook: Visualizing CVE-2011-0611 samples with Self-Organizing Maps</title><link>http://www.road2stat.com/cn/r_language/rook.html</link> <comments>http://www.road2stat.com/cn/r_language/rook.html#comments</comments> <pubDate>Sun, 04 Sep 2011 09:47:44 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[R]]></category> <category><![CDATA[Binary]]></category> <category><![CDATA[Executable]]></category> <category><![CDATA[Rook]]></category> <category><![CDATA[Self-Organizing Maps]]></category> <category><![CDATA[SOM]]></category> <category><![CDATA[Visualization]]></category> <category><![CDATA[web application]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=899</guid> <description><![CDATA[Inspired by Ruby's Rack Project, Jeffery Horner released his R package "Rook" [1] earlier this year. After trying to get several Rook applications running, I realized that Rook had avoided some certain disadvantages of Rapache. Rook is much more flexible &#8230; <a href="http://www.road2stat.com/cn/r_language/rook.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p>Inspired by Ruby's <a href="http://rack.rubyforge.org/" target="_blank">Rack</a> Project, Jeffery Horner released his R package "Rook" [1] earlier this year. After trying to get several Rook applications running, I realized that Rook had avoided some certain disadvantages of Rapache. Rook is much more flexible and easier to learn.</p><p>Theoretically speaking, once the proper plugin is done, your app could then be deployed under any web servers such as apache/lighthttpd/nginx, etc. Another significant advantage of Rook is, it's friendly for debugging. As Rook takes Rhttpd as the default server, you could preview your app on-the-fly, without any complicated deploying process.</p><p>Here's a test app, which implements the creative binary file visualization method described in the VizSec and Virol papers [2] and [3]. We choose to visualize the CVE-2011-0611 samples, which were retrieved from [4]. By using the Rook::File application simultaneously, we could serve static (png) files.</p><p>Load required pkgs:</p><div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p899code8'); return false;">View Code</a> RSPLUS</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p8998"><td class="line_numbers"><pre>1
2
3
</pre></td><td class="code" id="p899code8"><pre class="rsplus" style="font-family:monospace;"><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/require.html"><span style="color: #0000FF; font-weight: bold;">require</span></a><span style="color: #080;">&#40;</span>Rook<span style="color: #080;">&#41;</span>
<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/require.html"><span style="color: #0000FF; font-weight: bold;">require</span></a><span style="color: #080;">&#40;</span>digest<span style="color: #080;">&#41;</span>
<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/require.html"><span style="color: #0000FF; font-weight: bold;">require</span></a><span style="color: #080;">&#40;</span>kohonen<span style="color: #080;">&#41;</span></pre></td></tr></table></div><p>Write a Rook app:</p><div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left2">Download <a href="http://www.road2stat.com/cn/wp-content/plugins/wp-codebox/wp-codebox.php?p=899&amp;download=visbin.R">visbin.R</a></span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p8999"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
</pre></td><td class="code" id="p899code9"><pre class="rsplus" style="font-family:monospace;">newapp <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/function.html"><span style="color: #0000FF; font-weight: bold;">function</span></a><span style="color: #080;">&#40;</span>env<span style="color: #080;">&#41;</span> <span style="color: #080;">&#123;</span>
    req <span style="color: #080;">=</span> Rook<span style="color: #080;">::</span><span style="">Request</span>$new<span style="color: #080;">&#40;</span>env<span style="color: #080;">&#41;</span>
    res <span style="color: #080;">=</span> Rook<span style="color: #080;">::</span><span style="">Response</span>$new<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'Choose a Binary file to Train:<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;form method=&quot;POST&quot; enctype=&quot;multipart/form-data&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;input type=&quot;file&quot; name=&quot;data&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'xdim:<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;form method=&quot;POST&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;input type=&quot;text&quot; name=&quot;xdim&quot; value=&quot;12&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'ydim:<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;form method=&quot;POST&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;input type=&quot;text&quot; name=&quot;ydim&quot; value=&quot;25&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'ncolors:<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;form method=&quot;POST&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;input type=&quot;text&quot; name=&quot;ncolors&quot; value=&quot;8&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><span style="color: #ff0000;">'&lt;input type=&quot;submit&quot; name=&quot;Go!&quot;&gt;<span style="color: #000099; font-weight: bold;">\n</span>&lt;/form&gt;<span style="color: #000099; font-weight: bold;">\n</span>&lt;br&gt;'</span><span style="color: #080;">&#41;</span>
&nbsp;
    myNormalize <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/function.html"><span style="color: #0000FF; font-weight: bold;">function</span></a> <span style="color: #080;">&#40;</span>target<span style="color: #080;">&#41;</span> <span style="color: #080;">&#123;</span>
    <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/return.html"><span style="color: #0000FF; font-weight: bold;">return</span></a><span style="color: #080;">&#40;</span><span style="color: #080;">&#40;</span>target <span style="color: #080;">-</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/min.html"><span style="color: #0000FF; font-weight: bold;">min</span></a><span style="color: #080;">&#40;</span>target<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><span style="color: #080;">/</span><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/max.html"><span style="color: #0000FF; font-weight: bold;">max</span></a><span style="color: #080;">&#40;</span>target<span style="color: #080;">&#41;</span> <span style="color: #080;">-</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/min.html"><span style="color: #0000FF; font-weight: bold;">min</span></a><span style="color: #080;">&#40;</span>target<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
    <span style="color: #080;">&#125;</span>
&nbsp;
    <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/if.html"><span style="color: #0000FF; font-weight: bold;">if</span></a> <span style="color: #080;">&#40;</span><span style="color: #080;">!</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/is.null.html"><span style="color: #0000FF; font-weight: bold;">is.<span style="">null</span></span></a><span style="color: #080;">&#40;</span>req$POST<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span> <span style="color: #080;">&#123;</span>
    <span style="color: #0000FF; font-weight: bold;">data</span> <span style="color: #080;">=</span> req$POST<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #080;">&#91;</span><span style="color: #ff0000;">&quot;data&quot;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#93;</span>
    hash <span style="color: #080;">=</span> digest<span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">data</span>$tempfile, algo <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;md5&quot;</span>, <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/file.html"><span style="color: #0000FF; font-weight: bold;">file</span></a> <span style="color: #080;">=</span> TRUE<span style="color: #080;">&#41;</span>
    destFile <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/file.html"><span style="color: #0000FF; font-weight: bold;">file</span></a><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">data</span>$tempfile, <span style="color: #ff0000;">&quot;rb&quot;</span><span style="color: #080;">&#41;</span>
    k <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/floor.html"><span style="color: #0000FF; font-weight: bold;">floor</span></a><span style="color: #080;">&#40;</span><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/file.info.html"><span style="color: #0000FF; font-weight: bold;">file.<span style="">info</span></span></a><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">data</span>$tempfile<span style="color: #080;">&#41;</span>$size<span style="color: #080;">/</span><span style="color: #ff0000;">16</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span> <span style="color: #080;">-</span> <span style="color: #ff0000;">2</span>
    doneFile <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/readBin.html"><span style="color: #0000FF; font-weight: bold;">readBin</span></a><span style="color: #080;">&#40;</span>con <span style="color: #080;">=</span> destFile, what <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;raw&quot;</span>, n <span style="color: #080;">=</span> <span style="color: #ff0000;">2</span> <span style="color: #080;">*</span> <span style="color: #ff0000;">8</span> <span style="color: #080;">*</span> k<span style="color: #080;">&#41;</span>
    <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/close.html"><span style="color: #0000FF; font-weight: bold;">close</span></a><span style="color: #080;">&#40;</span>destFile<span style="color: #080;">&#41;</span>
    tmpFile0 <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/rbind.html"><span style="color: #0000FF; font-weight: bold;">rbind</span></a><span style="color: #080;">&#40;</span>doneFile<span style="color: #080;">&#91;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/seq.html"><span style="color: #0000FF; font-weight: bold;">seq</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">1</span>, <span style="color: #080;">&#40;</span><span style="color: #ff0000;">2</span> <span style="color: #080;">*</span> <span style="color: #ff0000;">8</span> <span style="color: #080;">*</span> k<span style="color: #080;">&#41;</span> <span style="color: #080;">-</span> <span style="color: #ff0000;">1</span>, <span style="color: #ff0000;">2</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span>, doneFile<span style="color: #080;">&#91;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/seq.html"><span style="color: #0000FF; font-weight: bold;">seq</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">2</span>, <span style="color: #080;">&#40;</span><span style="color: #ff0000;">2</span> <span style="color: #080;">*</span> <span style="color: #ff0000;">8</span> <span style="color: #080;">*</span> k<span style="color: #080;">&#41;</span>, <span style="color: #ff0000;">2</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#41;</span>
    tmpFile1 <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span>tmpFile0<span style="color: #080;">&#91;</span><span style="color: #ff0000;">1</span>, <span style="color: #080;">&#93;</span>, tmpFile0<span style="color: #080;">&#91;</span><span style="color: #ff0000;">2</span>, <span style="color: #080;">&#93;</span>, sep <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;&quot;</span><span style="color: #080;">&#41;</span>
    initMat <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/matrix.html"><span style="color: #0000FF; font-weight: bold;">matrix</span></a><span style="color: #080;">&#40;</span>strtoi<span style="color: #080;">&#40;</span>tmpFile1, 16L<span style="color: #080;">&#41;</span>, <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/ncol.html"><span style="color: #0000FF; font-weight: bold;">ncol</span></a> <span style="color: #080;">=</span> <span style="color: #ff0000;">8</span>, byrow <span style="color: #080;">=</span> TRUE<span style="color: #080;">&#41;</span>
    normMat <span style="color: #080;">=</span> myNormalize<span style="color: #080;">&#40;</span>initMat<span style="color: #080;">&#41;</span>
    trainedSOM <span style="color: #080;">=</span> kohonen<span style="color: #080;">::</span><span style="">som</span><span style="color: #080;">&#40;</span>normMat, <a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/grid.html"><span style="color: #0000FF; font-weight: bold;">grid</span></a> <span style="color: #080;">=</span> somgrid<span style="color: #080;">&#40;</span>xdim <span style="color: #080;">=</span> req$POST<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #080;">&#91;</span><span style="color: #ff0000;">&quot;xdim&quot;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#93;</span>, ydim <span style="color: #080;">=</span> req$POST<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #080;">&#91;</span><span style="color: #ff0000;">&quot;ydim&quot;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#93;</span>, <span style="color: #ff0000;">&quot;hexagonal&quot;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
    <a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/summary.lm.html"><span style="color: #0000FF; font-weight: bold;">png</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;/tmp/&quot;</span>, hash, <span style="color: #ff0000;">&quot;.png&quot;</span>, sep <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;&quot;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
    <a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/plot.html"><span style="color: #0000FF; font-weight: bold;">plot</span></a><span style="color: #080;">&#40;</span>trainedSOM, type <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;dist.neighbours&quot;</span>, palette.<span style="">name</span> <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/summary.lm.html"><span style="color: #0000FF; font-weight: bold;">rainbow</span></a>, ncolors <span style="color: #080;">=</span> <a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/as.numeric.html"><span style="color: #0000FF; font-weight: bold;">as.<span style="">numeric</span></span></a><span style="color: #080;">&#40;</span>req$POST<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#91;</span><span style="color: #080;">&#91;</span><span style="color: #ff0000;">&quot;ncolors&quot;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#93;</span><span style="color: #080;">&#41;</span>, main <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;&quot;</span><span style="color: #080;">&#41;</span>
    <a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/summary.lm.html"><span style="color: #0000FF; font-weight: bold;">dev.<span style="">off</span></span></a><span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span>
    res$write<span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;&lt;img src='&quot;</span>, s$full_url<span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;pic&quot;</span><span style="color: #080;">&#41;</span>, <span style="color: #ff0000;">&quot;/&quot;</span>, hash, <span style="color: #ff0000;">&quot;.png'&quot;</span>, <span style="color: #ff0000;">&quot; /&gt;&quot;</span>, sep <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;&quot;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
    <span style="color: #080;">&#125;</span>
    res$finish<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#125;</span></pre></td></tr></table></div><p>Initialize/Run the app:</p><div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p899code10'); return false;">View Code</a> RSPLUS</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p89910"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p899code10"><pre class="rsplus" style="font-family:monospace;">s <span style="color: #080;">=</span> Rhttpd$new<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span>
s$add<span style="color: #080;">&#40;</span>app <span style="color: #080;">=</span> newapp, name <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;visbin&quot;</span><span style="color: #080;">&#41;</span>
s$add<span style="color: #080;">&#40;</span>app <span style="color: #080;">=</span> File$new<span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;/tmp&quot;</span><span style="color: #080;">&#41;</span>, name <span style="color: #080;">=</span> <span style="color: #ff0000;">&quot;pic&quot;</span><span style="color: #080;">&#41;</span>
s$start<span style="color: #080;">&#40;</span><span style="color: #080;">&#41;</span>
s$browse<span style="color: #080;">&#40;</span><span style="color: #ff0000;">&quot;visbin&quot;</span><span style="color: #080;">&#41;</span></pre></td></tr></table></div><p>Firstly the app hashes the uploaded files then trains SOM models. As the training result differs each time, we may train more times to get the better one.</p><p>We use the U-Matrix to visualize the Self-Organizing Maps, The U-Matrix value of a particular unit is the average distance between the unit and its closest neighbors, then color was used to represent the value. Actually, the number of the color palette is critical, too much or too little may interfere the detection of potential cluster patterns.</p><p>There exists much more methods for dimensional reduction and visualization with R packages, you may refer to the R News (R Journal) paper [5].</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/09/visbin.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/09/visbin.jpg" alt="visbin" title="visbin" width="445" height="623" class="aligncenter size-full wp-image-901" /></a></p><p>It clearly shows that a cluster pattern appears in the lower right corner. It's reasonable to suspect the file was injected with some data that shouldn't be there.</p><p>The paper says it got bad results when visualizing macro viruses (embedded in Microsoft Office files). Actually, the CVE-2011-0611 sample are doc/xls files, but they are not macro viruses. They're hosts injected with harmful Adobe swf files. From this point of view, they're just like the infected executable files. So theory still applies.</p><p>A detail is, after uploading, the <code>data$tempfile</code> has a different MD5 with the original file, it gains extra hex 0D 0A (seems a new line) in the end. I don't quite understand how this happens. As we had deleted the last two lines of the file to form a proper matrix, the training data is not identical with the binary sample. Nothing influences for this case.</p><p>In summary, Rook connects the 3000+ available R package and web application development, just 40 lines of code were used to achieve a not-so-simple goal, it's really amazing.</p><h1>References</h1><p>[1] <a href="http://cran.r-project.org/web/packages/Rook/index.html" target="_blank">Rook - a web server interface for R</a>.<br /> [2] Visualizing Windows Executable Viruses Using Self-Organizing Maps, VizSec, 2004.<br /> [3] Non-signature Based Virus Detection, Journal in Computer Virology, 2:163–186, 2006.<br /> [4] Contagio Malware Dump. <a href="http://contagiodump.blogspot.com/2011/04/apr-8-cve-2011-0611-flash-player-zero.html" target="_blank">Apr. 8 CVE-2011-0611 Flash Player Zero day - SWF in DOC/ XLS - Disentangling Industrial Policy</a>.<br /> [5] Dimensional Reduction for Data Mapping, R News, Vol. 3/3, 2003.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/r_language/rook.html/feed</wfw:commentRss> <slash:comments>3</slash:comments> </item> </channel> </rss>
