豆瓣评分计算策略的猜想

simpsons_movie

1 引

在九月短文 [1] 中, 我们对豆瓣电影评分的一个侧面有了简单认识. 其实, 我们对评分计算规则本身也是很感兴趣的. 这里以豆瓣电影为例作一简单猜想和分析, 音乐图书同理. 题中"策略"是相对"机制"来说的, 所指其实是比较具体的.

2 单个条目

有群众表示, 单个条目的评分计算只是对各个星级打分人数简单的加权平均, 由于页面上显示的评分结果满分是10分, 而打分时只有5个星级, 所以每个星级对应2分, 单个条目评分的计算公式即为:

评分 = (10 x 5星比例) + (8 x 4星比例) + ... + (2 x 1星比例)

抽取部分条目对此假设进行手工验证, 可以发现的确如此.

但是, 这里存在的一个陷阱是, 由于评分数据的特殊性和抽样的限制, 如果我们抽取一部分数据做回归, 结果可能会受到样本的影响而与手工验证的结果产生偏移. 由于1星(很差)和2星(较差)在大量条目样本中所占往往比例非常小, 普通的回归非常容易倾向于使X1, X2, X3的系数减小. 举例来说, 从我看过的443部电影中抽取前400个条目作为样本 [rateSample.csv] 作回归.

继续阅读

Ten Typical Symptoms of Potential Academic Paranoia

Prof.Frink

  1. Getting used to writing articles that begin with a section named 'Introduction' or end up with section 'Conclusions'.
  2. Always cites several references in any type/length of essays; strongly believes that without the citations, the work will not be recognized by anybody.
  3. Hates magazines with huge pictures and imprecise textual materials; has a special fondness for two-column, small font, tight dissertations with formulas, three-line tables, and stylish, dot-and-line formed scalable graphics.
  4. Uses a reference manager, instead of regular tools such as Google Calendar, to organize daily life.
  5. Blogs academic topics constantly for 2.5+ years, or has set up a stand-alone blog about current research.
  6. Talks academic in 50%+ Twitter/Facebook status in last 2 years, or has pure academic purpose social accounts.
  7. Used to have at least one horrible nightmares about a B+ ruined perfect straight As, just like Lisa Simpson did.
  8. Once encountered some data from the middle of nowhere, always considers what its underlying patterns look like; imagines constructing a quantitative model for it, very seriously.
  9. When saw a problem, couldn't help diving into scholar databases to retrieve related papers, thoroughly read the references and dug recursively; Gigabytes of papers are storaged in the hard-drive eventually.
  10. Blogs academic paranoia and doesn't feel anything, until now.

Visualizing CRAN Package Dependency Network: Reveal Hidden Patterns with Martin Krzywinski's Hive Panel

1 Introduction

Studying the networks of online software community is fascinating. CPAN Explorer is a typical project aiming at analyzing the relationships in CPAN community [1]. CRAN package dependency network is another excellent source for this type of research. A state-of-art visualization is usually required to understand the network [2].

A common problem of conventional hairball style network visualization is: the graph becomes uninterpretable when it meets very large networks [3]. Researchers developed techniques such as hierarchical edge bundles [4] to tackle this problem. However, that's just too ideal for real world visualization problems. When it's emphasizing the strong connections in the network, the less strong part and the key details could possibly be ignored. Conventional visualization methods have constrained us to take a further step: revealing more hidden information of the internal structure (vertices, connectivity, etc.) in the network.

2 Hive Plots

Martin Krzywinski, author of the circular style genome visualization tool circos, proposed the hive plots in 2010 [5]. The most significant difference between hive plots and traditional layout is: its graphic design is based on the network's meaningful properties (vertices' degree, connectivity, centrality, etc.) instead of aesthetics. This design makes the graph interpretable and thus simplifies the presentation of relational data.

3 The Visualization

We selected 27 representative packages and visualize every three of them in one hive plot to make a 3x3 hive panel. Each panel represents a specific research field. Each node of the network is mapped on the axes by its degree information: green axis represents out-degree, orange axis represents in-degree, and purple axis combines in/out-degrees together. On each axis, outer nodes have higher degrees. The white connections, as the background, show us the overall connectivity of the network: the nodes have higher out-degrees are heavily depended by all ranges of nodes in the network, and the brighter parts of the arcs tend to indicate potential cluster patterns.

hiveplot

Click here to see a larger version.

hivepanel

Click here to see a larger version.

Meanwhile, we highlight three of the interested packages in each research field in one panel with three different colors to reveal its specific connection patterns. For the first panel, green connections represents lattice package. It's a fundamental package for graphic design in R, which is heavily depended by packages of all degrees. The purple connections represent the rgl package. It depends a little but it's depended by much more packages that distributed more discretely on the orange axis than lattice was. Orange lines represent the gplots package, which contains various miscellaneous tools for plotting. Obviously, the dependency patterns indicate its different role between the previous ones: it's more of a handy toolset for plotting, rather than a core package. The upper right panel shows us three of the data import/export packages: DBI, RODBC and RSQLite. Amazingly, althought they play different roles in the whole community, their dependency patterns are almost the same, except for a little difference between their degrees. The central panel, which highlights the finance-related packages fBasics, fOptions, and fGarch, reveals similar features.

Hive plots are relatively much more informative and comprehensive than conventional hairball-style visualizations, especially for large networks. You could discover much more interesting patterns in other panels yourself with this visualization.

The selected packages (ordered by panel 11, 12, 13, 21, 22 …) are:

  • Graphics: lattice / rgl / gplots (Green / Purple / Orange)
  • Programming: tools / rJava / Rcpp
  • Data Import/Export: DBI / RODBC / RSQLite
  • GUI Dev Tools & Framework: tcltk / gWidgets / Rcmdr
  • Finance: fBasics / fOptions / fGarch
  • Machine Learning: e1071 / rpart / randomForest
  • Regression Analysis: car / leaps / quantreg
  • Spatial and Geo Statistics: sp / maps / fields
  • Time Series Analysis: forecast / timeDate / tseries

4 Details

The creation of this visualization is really simple; highly reproducible for anyone who has a little knowledge of SNA [6]:

  1. The original data was retrieved from
    http://cran.r-project.org/bin/windows/contrib/2.13/PACKAGES
    on September 14, 2011. We only extracted the 'Depends' section of each package. After parsing and a bit of cleaning, a network consisted of 2,500 vertices and 5,900 arcs was constructed.
  2. To shrink the network, perform k-core analysis and extract the 4-6 cores partition to form a new network, a denser one, with less noise. Now it's reduced to about 600 vertices and 2,500 arcs.
  3. Draw the shrinked network permuted by degree information with Martin's linnet tool. Each single panel implies a package's degree and dependency distribution properties. Combine the 9 separated hive plots to form a complete hive panel.

References

[1] Julian Bilcke. CPAN Explorer - An Interactive Exploration of the Perl Ecosystem. http://cpan-explorer.org/, 2009.
[2] Xiao Nan. R2S - PKU Vis Summer School. http://www.road2stat.com/cn/statistics/pku_vis_summer_school.html, 2010.
[3] Koon-Kiu Yana, Gang Fanga, Nitin Bhardwaja, Roger P. Alexandera, Mark Gerstein. Comparing Genomes to Computer Operating Systems in Terms of the Topology and Evolution of their Regulatory Control Networks. Proceedings of the National Academy of Sciences, 107 (20): 9186 - 9191, 2006.
[4] Danny Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Transactions on Visualization and Computer Graphics (TVCG; Proceedings of Vis/InfoVis 2006), Vol. 12, No. 5, 741 - 748, 2006.
[5] Martin Krzywinski. Hive Plots - Linear Layout for Network Visualization - Visually Interpreting Network Structure and Content Made Possible. http://www.hiveplot.com/, 2010.
[6] Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj. Exploratory Social Network Analysis with Pajek. Cambridge University Press, 2005.
[7] J.R. Heard. World Economic Forum Hive Plot. http://www.visualizing.org/visualizations/world-economic-forum-hive-plot/, 2010.

豆瓣用户对不同类型影片的打分是否真的有倾向性?

1 起

在豆瓣上为数不少的恐怖/惊悚片的讨论中, 我们常常可以发现类似于这样的说法 [1]:

像这部片子
也有吸引人看下去的地方
为什么分数总是那么低?

那么, 豆瓣用户对这类影片的打分上, 是否真的存在普遍低于其他类型的影片的情况? 为了验证这个猜想, 我们不妨利用豆瓣提供的评分数据来简单分析一下.

2 承

首先明确问题的定义. 这里我们不去比较恐怖片和其他所有类型片的总体. 其实, 更让人感兴趣的问题是, 将恐怖片与其他同级别类型的影片分别进行两两比较, 结果会如何.

豆瓣的电影条目是采用tag来进行分类的. 此时样本的选取成了一个问题. 总的来说, 要保证各类型影片的类型特征区别要尽量大, 比如恐怖片和惊悚片之间的差别没有恐怖片和励志片的差别明显, 又如有可能一部影片既有"爱情"标签, 也有"喜剧"标签, 也就是说, 各类型的影片将存在交集. 同时, 也要保证各类样本在其他方面的差别尽量小, 如不同类型影片的总体规模差距不能过于悬殊等等.

我是这样做的.

继续阅读

搜索引擎之人肉学习

夏日炎炎, 多多灌水, 有益身心健康.

Blekko的logo好像在前些日子由原来的beta升级为beta^2了, 页面正下方出现了一个名为"3 engine monte"的游戏. 玩法是, 输入"关键词 /Monte", 即可平行显示去格式后的Blekko, Google, Bing三方搜索结果, 用户可以根据经验猜测哪个结果是来自Blekko的, 点选以后即可显示猜测是否正确:

blekko_game

这里我以R为关键词进行搜索, 三个结果虽然都只有一页, 不过各有千秋. Google的结果明显偏重技术, bing的结果很杂, 不太靠谱. 而Blekko的结果的人为控制痕迹则比较明显, 基本上是一些比较权威信息源的首页.

继续阅读

《Graphics of Large Datasets》 原书第11章

gold

这是Springer Statistics and Computing丛书中《Graphics of Large Datasets: Visualizing a Million》一书的第11章. 内容是对InfoVis 2005会议竞赛单元所提供数据集的可视化和分析. 作者之一的Martin Theus也是R的交互式图形包iPlots和R的Java GUI前端JGR的作者.

单独将这章提出来的原因是, 一般相关书籍和资料比较偏重理论, 例证比较零散, 不够系统, 往往缺乏对现实数据集完整的分析. 这样贴近现实而详尽的材料相对难得.

半年前做了一部分, 这两天放假又整理了一下就放上来了. 图形提取自原书电子版, 在内容和版式上尽力保持了原书风貌.

值得一提的是, 竞赛结果中位列1st的两队之一是来自IAState的Heike Hofmann, Hadley Wickham, Dianne Cook, Junjie Sun, Christian Röttger, 而本书的三位作者之一是Heike Hofmann. 本书也是本章的两位作者所在队位居2nd.

PDF, 3.3 MB