Visualizing CRAN Package Dependency Network: Reveal Hidden Patterns with Martin Krzywinski's Hive Panel

1 Introduction

Studying the networks of online software community is fascinating. CPAN Explorer is a typical project aiming at analyzing the relationships in CPAN community [1]. CRAN package dependency network is another excellent source for this type of research. A state-of-art visualization is usually required to understand the network [2].

A common problem of conventional hairball style network visualization is: the graph becomes uninterpretable when it meets very large networks [3]. Researchers developed techniques such as hierarchical edge bundles [4] to tackle this problem. However, that's just too ideal for real world visualization problems. When it's emphasizing the strong connections in the network, the less strong part and the key details could possibly be ignored. Conventional visualization methods have constrained us to take a further step: revealing more hidden information of the internal structure (vertices, connectivity, etc.) in the network.

2 Hive Plots

Martin Krzywinski, author of the circular style genome visualization tool circos, proposed the hive plots in 2010 [5]. The most significant difference between hive plots and traditional layout is: its graphic design is based on the network's meaningful properties (vertices' degree, connectivity, centrality, etc.) instead of aesthetics. This design makes the graph interpretable and thus simplifies the presentation of relational data.

3 The Visualization

We selected 27 representative packages and visualize every three of them in one hive plot to make a 3x3 hive panel. Each panel represents a specific research field. Each node of the network is mapped on the axes by its degree information: green axis represents out-degree, orange axis represents in-degree, and purple axis combines in/out-degrees together. On each axis, outer nodes have higher degrees. The white connections, as the background, show us the overall connectivity of the network: the nodes have higher out-degrees are heavily depended by all ranges of nodes in the network, and the brighter parts of the arcs tend to indicate potential cluster patterns.

hiveplot

Click here to see a larger version.

hivepanel

Click here to see a larger version.

Meanwhile, we highlight three of the interested packages in each research field in one panel with three different colors to reveal its specific connection patterns. For the first panel, green connections represents lattice package. It's a fundamental package for graphic design in R, which is heavily depended by packages of all degrees. The purple connections represent the rgl package. It depends a little but it's depended by much more packages that distributed more discretely on the orange axis than lattice was. Orange lines represent the gplots package, which contains various miscellaneous tools for plotting. Obviously, the dependency patterns indicate its different role between the previous ones: it's more of a handy toolset for plotting, rather than a core package. The upper right panel shows us three of the data import/export packages: DBI, RODBC and RSQLite. Amazingly, althought they play different roles in the whole community, their dependency patterns are almost the same, except for a little difference between their degrees. The central panel, which highlights the finance-related packages fBasics, fOptions, and fGarch, reveals similar features.

Hive plots are relatively much more informative and comprehensive than conventional hairball-style visualizations, especially for large networks. You could discover much more interesting patterns in other panels yourself with this visualization.

The selected packages (ordered by panel 11, 12, 13, 21, 22 …) are:

  • Graphics: lattice / rgl / gplots (Green / Purple / Orange)
  • Programming: tools / rJava / Rcpp
  • Data Import/Export: DBI / RODBC / RSQLite
  • GUI Dev Tools & Framework: tcltk / gWidgets / Rcmdr
  • Finance: fBasics / fOptions / fGarch
  • Machine Learning: e1071 / rpart / randomForest
  • Regression Analysis: car / leaps / quantreg
  • Spatial and Geo Statistics: sp / maps / fields
  • Time Series Analysis: forecast / timeDate / tseries

4 Details

The creation of this visualization is really simple; highly reproducible for anyone who has a little knowledge of SNA [6]:

  1. The original data was retrieved from
    http://cran.r-project.org/bin/windows/contrib/2.13/PACKAGES
    on September 14, 2011. We only extracted the 'Depends' section of each package. After parsing and a bit of cleaning, a network consisted of 2,500 vertices and 5,900 arcs was constructed.
  2. To shrink the network, perform k-core analysis and extract the 4-6 cores partition to form a new network, a denser one, with less noise. Now it's reduced to about 600 vertices and 2,500 arcs.
  3. Draw the shrinked network permuted by degree information with Martin's linnet tool. Each single panel implies a package's degree and dependency distribution properties. Combine the 9 separated hive plots to form a complete hive panel.

References

[1] Julian Bilcke. CPAN Explorer - An Interactive Exploration of the Perl Ecosystem. http://cpan-explorer.org/, 2009.
[2] Xiao Nan. R2S - PKU Vis Summer School. http://www.road2stat.com/cn/statistics/pku_vis_summer_school.html, 2010.
[3] Koon-Kiu Yana, Gang Fanga, Nitin Bhardwaja, Roger P. Alexandera, Mark Gerstein. Comparing Genomes to Computer Operating Systems in Terms of the Topology and Evolution of their Regulatory Control Networks. Proceedings of the National Academy of Sciences, 107 (20): 9186 - 9191, 2006.
[4] Danny Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Transactions on Visualization and Computer Graphics (TVCG; Proceedings of Vis/InfoVis 2006), Vol. 12, No. 5, 741 - 748, 2006.
[5] Martin Krzywinski. Hive Plots - Linear Layout for Network Visualization - Visually Interpreting Network Structure and Content Made Possible. http://www.hiveplot.com/, 2010.
[6] Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj. Exploratory Social Network Analysis with Pajek. Cambridge University Press, 2005.
[7] J.R. Heard. World Economic Forum Hive Plot. http://www.visualizing.org/visualizations/world-economic-forum-hive-plot/, 2010.

豆瓣用户对不同类型影片的打分是否真的有倾向性?

1 起

在豆瓣上为数不少的恐怖/惊悚片的讨论中, 我们常常可以发现类似于这样的说法 [1]:

像这部片子
也有吸引人看下去的地方
为什么分数总是那么低?

那么, 豆瓣用户对这类影片的打分上, 是否真的存在普遍低于其他类型的影片的情况? 为了验证这个猜想, 我们不妨利用豆瓣提供的评分数据来简单分析一下.

2 承

首先明确问题的定义. 这里我们不去比较恐怖片和其他所有类型片的总体. 其实, 更让人感兴趣的问题是, 将恐怖片与其他同级别类型的影片分别进行两两比较, 结果会如何.

豆瓣的电影条目是采用tag来进行分类的. 此时样本的选取成了一个问题. 总的来说, 要保证各类型影片的类型特征区别要尽量大, 比如恐怖片和惊悚片之间的差别没有恐怖片和励志片的差别明显, 又如有可能一部影片既有"爱情"标签, 也有"喜剧"标签, 也就是说, 各类型的影片将存在交集. 同时, 也要保证各类样本在其他方面的差别尽量小, 如不同类型影片的总体规模差距不能过于悬殊等等.

我是这样做的.

继续阅读

战神:斯巴达幽灵

“与其临渊羡鱼,不如隔岸观火;与其坐而论道,不如纸上谈兵。”

kratos

看得出,为了超越08年的《战神:奥林匹斯之链》,索尼在这一作上还是下了功夫的。在希腊神话的基础上构建出这样一个游戏,可以说是比较出色了。尤其是还能将亚特兰蒂斯/弥达斯等等现实传说与Kratos的故事联系在一起,想象非常合理,整个作品也无愧于两年的等待,一款上乘佳作。

故事脚本方面,较前作更让人信服。主题由复仇进化为救赎,尤其是关于弟弟Deimos的情节使用了大量的闪回:在母上变为怪兽之前的对话,进入斯巴达城后和幼年Deimos的战斗,以及最后阿瑞斯神殿中和幼年Kratos的乱斗。虽说在人物关系上有较多的生搬硬套之处,整个故事大纲倒还是可圈可点的。

战斗难度上,与上作基本持平,没有遇到太困难的关卡,有些场景几乎是一气呵成。除了那一场dog fight,破坏大齿轮用了相对长的时间外,其他场景的战斗都很流畅。但武器系统依然是乏善可陈,换汤不换药,只是在雅典娜神剑上增加了一个会不断消耗和增长的席拉火焰,设计得比较天真,不够cool。

道具解谜比前作圡多了:没有出现那么多的“高科技”道具,而是更加朴实又强迫你思考。让人印象比较深的是一个道具多次使用的设计。同时,前进路径和道具位置也更加隐蔽。道具之间的衔接、各类机关的物理原理都有所创新。值得一提的是,本作需要绳索通过和峭壁攀援的场景明显增多,可以说是突出了“深渊”二字:老是悬空跳来跳去,不得不说都有点像波斯王子/刺客信条了。

场景方面,空间尺度比前作要宏大很多,上天入地还有海底一日游,冰火相间,层次感非常强。场景转换节奏较上一作有所加快,连贯性得到了加强,几乎没有前作那样在一个多层神殿内来回探索的事情发生,不过几次坍塌逃生桥段未免有些雷同。颜色运用上,有偏冷的和死神女儿厄里倪厄斯战斗的山脉和Domain of Death的塔纳托斯神殿,也有色调偏暖的火山和斯巴达城。总体来说颜色运用很舒服,在完成比前作长很多的整个流程以后,眼睛也不会感到太累。

配乐中规中矩,没有亮点也没有明显的缺憾。但Kratos的配音值得批评,沙哑深沉得有点做作。

当然,《战神》整个故事的精髓在于根植于Kratos内心深处的反抗精神,正如他和NPC的对话:"I don't want to be god. The gods could take the honor back." 虽说奎爷总是反复强调这一点,未免有说教之嫌。不过,我始终想吐槽的是,制作人员竟然把前作存档时“众神决定给你一次机会,是否存档”这么幽默的语言去掉了,不能不说是一个遗憾。

Rapid Prototyping R based Web Applications with Rook: Visualizing CVE-2011-0611 samples with Self-Organizing Maps

Inspired by Ruby's Rack Project, Jeffery Horner released his R package "Rook" [1] earlier this year. After trying to get several Rook applications running, I realized that Rook had avoided some certain disadvantages of Rapache. Rook is much more flexible and easier to learn.

Theoretically speaking, once the proper plugin is done, your app could then be deployed under any web servers such as apache/lighthttpd/nginx, etc. Another significant advantage of Rook is, it's friendly for debugging. As Rook takes Rhttpd as the default server, you could preview your app on-the-fly, without any complicated deploying process.

Here's a test app, which implements the creative binary file visualization method described in the VizSec and Virol papers [2] and [3]. We choose to visualize the CVE-2011-0611 samples, which were retrieved from [4]. By using the Rook::File application simultaneously, we could serve static (png) files.

Load required pkgs:

?View Code RSPLUS
1
2
3
require(Rook)
require(digest)
require(kohonen)

Write a Rook app:

?Download visbin.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
newapp = function(env) {
    req = Rook::Request$new(env)
    res = Rook::Response$new()
    res$write('Choose a Binary file to Train:\n')
    res$write('<form method="POST" enctype="multipart/form-data">\n')
    res$write('<input type="file" name="data">\n')
    res$write('xdim:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="xdim" value="12">\n')
    res$write('ydim:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="ydim" value="25">\n')
    res$write('ncolors:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="ncolors" value="8">\n')
    res$write('<input type="submit" name="Go!">\n</form>\n<br>')
 
    myNormalize = function (target) {
    return((target - min(target))/(max(target) - min(target)))
    }
 
    if (!is.null(req$POST())) {
    data = req$POST()[["data"]]
    hash = digest(data$tempfile, algo = "md5", file = TRUE)
    destFile = file(data$tempfile, "rb")
    k = floor((file.info(data$tempfile)$size/16)) - 2
    doneFile = readBin(con = destFile, what = "raw", n = 2 * 8 * k)
    close(destFile)
    tmpFile0 = rbind(doneFile[seq(1, (2 * 8 * k) - 1, 2)], doneFile[seq(2, (2 * 8 * k), 2)])
    tmpFile1 = paste(tmpFile0[1, ], tmpFile0[2, ], sep = "")
    initMat = matrix(strtoi(tmpFile1, 16L), ncol = 8, byrow = TRUE)
    normMat = myNormalize(initMat)
    trainedSOM = kohonen::som(normMat, grid = somgrid(xdim = req$POST()[["xdim"]], ydim = req$POST()[["ydim"]], "hexagonal"))
    png(paste("/tmp/", hash, ".png", sep = ""))
    plot(trainedSOM, type = "dist.neighbours", palette.name = rainbow, ncolors = as.numeric(req$POST()[["ncolors"]]), main = "")
    dev.off()
    res$write(paste("<img src='", s$full_url("pic"), "/", hash, ".png'", " />", sep = ""))
    }
    res$finish()
}

Initialize/Run the app:

?View Code RSPLUS
1
2
3
4
5
s = Rhttpd$new()
s$add(app = newapp, name = "visbin")
s$add(app = File$new("/tmp"), name = "pic")
s$start()
s$browse("visbin")

Firstly the app hashes the uploaded files then trains SOM models. As the training result differs each time, we may train more times to get the better one.

We use the U-Matrix to visualize the Self-Organizing Maps, The U-Matrix value of a particular unit is the average distance between the unit and its closest neighbors, then color was used to represent the value. Actually, the number of the color palette is critical, too much or too little may interfere the detection of potential cluster patterns.

There exists much more methods for dimensional reduction and visualization with R packages, you may refer to the R News (R Journal) paper [5].

visbin

It clearly shows that a cluster pattern appears in the lower right corner. It's reasonable to suspect the file was injected with some data that shouldn't be there.

The paper says it got bad results when visualizing macro viruses (embedded in Microsoft Office files). Actually, the CVE-2011-0611 sample are doc/xls files, but they are not macro viruses. They're hosts injected with harmful Adobe swf files. From this point of view, they're just like the infected executable files. So theory still applies.

A detail is, after uploading, the data$tempfile has a different MD5 with the original file, it gains extra hex 0D 0A (seems a new line) in the end. I don't quite understand how this happens. As we had deleted the last two lines of the file to form a proper matrix, the training data is not identical with the binary sample. Nothing influences for this case.

In summary, Rook connects the 3000+ available R package and web application development, just 40 lines of code were used to achieve a not-so-simple goal, it's really amazing.

References

[1] Rook - a web server interface for R.
[2] Visualizing Windows Executable Viruses Using Self-Organizing Maps, VizSec, 2004.
[3] Non-signature Based Virus Detection, Journal in Computer Virology, 2:163–186, 2006.
[4] Contagio Malware Dump. Apr. 8 CVE-2011-0611 Flash Player Zero day - SWF in DOC/ XLS - Disentangling Industrial Policy.
[5] Dimensional Reduction for Data Mapping, R News, Vol. 3/3, 2003.

结合豆瓣基础API学习XML包

很久以前在R-Forge上注册过一个RDouban项目, 想用豆瓣提供的API做点好玩的事情. 可惜后来只写了个开头, 感兴趣的童鞋可以无条件认领. 在这里结合豆瓣的基础API, 非常简略地写一下用XML包读数据的基本问题.

1 XPath

花十分钟学习XPath语法.
熟练后可使用Firebug等调试工具直接提取. 此外, 要特别注意XML命名空间问题. (感谢yixuan提醒)

2 Douban API

花n分钟阅读"豆瓣API参考手册".
用户的评论、收藏、广播、豆邮等交互功能往往需要先进行OAuth认证, 建议阅读RFC5849以充分理解OAuth协议. 这块目前也有ROAuth包可以实现, 不过与读数据没什么关系, 此略.

继续阅读

1995: 上古年代

对于我的认知水平,1995年算是上古年代。对于斯人旧事,1995,又何尝不是上古年代。十六年逝去,对我们当中的某些人来说,只是一场游戏一场梦。

1995

1995年,已经在北京工作五年多的毕福剑,事业仍旧毫无起色。年初,他赶回辽宁老家,和瘫痪在床的父亲说,自己要出一趟远门,春节不能回家过年了。老父亲没再多问什么。

1995年,在安徽濉溪举办的一场青年男篮比赛结束后,一名球员到场边的小卖部买饮料。一个小男孩儿跟着他要签名,之前对“签名”没有任何概念的年轻人,在纸上端端正正写下了“姚明”两个字,工整得几乎和打印出来的一样。

1995年,仅凭集资和卖楼花的方式筹款超过1亿元的史玉柱冲入了福布斯内地富豪榜前十名。意气风发的他在珠海彻夜明亮的工地前,最后一次展开了巨人大厦的建筑蓝图,将设计楼层数修改为与中银大厦比肩的70层,加上楼顶尖塔,总高度达到了300米。

1995年,谢贤之子谢霆锋转入香港国际学校读书,他与一名陈姓同学言语投机,关系很好,甚至有时会到对方家中过夜。

1995年,远在大洋彼岸的莫妮卡·莱温斯基与200名年轻的实习生聚集到白宫,准备开始一段为期6周的无薪实习。由于大部分实习生都将取得政治学相关学科的学位,莫妮卡只好开玩笑说,在华盛顿这种狂乱的地方,其实她的心理学学位倒更能派得上用场。

1995年,从未赢利过的网景公司在纳斯达克上市。投行预计每股只能卖到14美元左右;这一天,密歇根大学的本科毕业生拉里·佩奇正在考虑到哪所学校继续深造,斯坦福大学指派了一个叫谢尔盖·布林的学生带着他参观了校园;华尔街不远的一所中学外,11岁的初中生马克·扎克伯格买了一本黄皮的《C++ for Dummies》,开始自学编程。

与此同时,在英国的爱丁堡,一个旅行归来的医生看到一张他外出时的就诊单后,立刻打了电话给病人,幸运地将病人从自杀边缘挽救回来:她刚刚离婚,是一位单身母亲和潦倒作家。为了省电,有时只点上一杯咖啡,呆在咖啡馆里写上一整天。

这位英国女病人,就是J·K·罗琳。