Visualizing CRAN Package Dependency Network: Reveal Hidden Patterns with Martin Krzywinski's Hive Panel

1 Introduction

Studying the networks of online software community is fascinating. CPAN Explorer is a typical project aiming at analyzing the relationships in CPAN community [1]. CRAN package dependency network is another excellent source for this type of research. A state-of-art visualization is usually required to understand the network [2].

A common problem of conventional hairball style network visualization is: the graph becomes uninterpretable when it meets very large networks [3]. Researchers developed techniques such as hierarchical edge bundles [4] to tackle this problem. However, that's just too ideal for real world visualization problems. When it's emphasizing the strong connections in the network, the less strong part and the key details could possibly be ignored. Conventional visualization methods have constrained us to take a further step: revealing more hidden information of the internal structure (vertices, connectivity, etc.) in the network.

2 Hive Plots

Martin Krzywinski, author of the circular style genome visualization tool circos, proposed the hive plots in 2010 [5]. The most significant difference between hive plots and traditional layout is: its graphic design is based on the network's meaningful properties (vertices' degree, connectivity, centrality, etc.) instead of aesthetics. This design makes the graph interpretable and thus simplifies the presentation of relational data.

3 The Visualization

We selected 27 representative packages and visualize every three of them in one hive plot to make a 3x3 hive panel. Each panel represents a specific research field. Each node of the network is mapped on the axes by its degree information: green axis represents out-degree, orange axis represents in-degree, and purple axis combines in/out-degrees together. On each axis, outer nodes have higher degrees. The white connections, as the background, show us the overall connectivity of the network: the nodes have higher out-degrees are heavily depended by all ranges of nodes in the network, and the brighter parts of the arcs tend to indicate potential cluster patterns.

hiveplot

Click here to see a larger version.

hivepanel

Click here to see a larger version.

Meanwhile, we highlight three of the interested packages in each research field in one panel with three different colors to reveal its specific connection patterns. For the first panel, green connections represents lattice package. It's a fundamental package for graphic design in R, which is heavily depended by packages of all degrees. The purple connections represent the rgl package. It depends a little but it's depended by much more packages that distributed more discretely on the orange axis than lattice was. Orange lines represent the gplots package, which contains various miscellaneous tools for plotting. Obviously, the dependency patterns indicate its different role between the previous ones: it's more of a handy toolset for plotting, rather than a core package. The upper right panel shows us three of the data import/export packages: DBI, RODBC and RSQLite. Amazingly, althought they play different roles in the whole community, their dependency patterns are almost the same, except for a little difference between their degrees. The central panel, which highlights the finance-related packages fBasics, fOptions, and fGarch, reveals similar features.

Hive plots are relatively much more informative and comprehensive than conventional hairball-style visualizations, especially for large networks. You could discover much more interesting patterns in other panels yourself with this visualization.

The selected packages (ordered by panel 11, 12, 13, 21, 22 …) are:

  • Graphics: lattice / rgl / gplots (Green / Purple / Orange)
  • Programming: tools / rJava / Rcpp
  • Data Import/Export: DBI / RODBC / RSQLite
  • GUI Dev Tools & Framework: tcltk / gWidgets / Rcmdr
  • Finance: fBasics / fOptions / fGarch
  • Machine Learning: e1071 / rpart / randomForest
  • Regression Analysis: car / leaps / quantreg
  • Spatial and Geo Statistics: sp / maps / fields
  • Time Series Analysis: forecast / timeDate / tseries

4 Details

The creation of this visualization is really simple; highly reproducible for anyone who has a little knowledge of SNA [6]:

  1. The original data was retrieved from
    http://cran.r-project.org/bin/windows/contrib/2.13/PACKAGES
    on September 14, 2011. We only extracted the 'Depends' section of each package. After parsing and a bit of cleaning, a network consisted of 2,500 vertices and 5,900 arcs was constructed.
  2. To shrink the network, perform k-core analysis and extract the 4-6 cores partition to form a new network, a denser one, with less noise. Now it's reduced to about 600 vertices and 2,500 arcs.
  3. Draw the shrinked network permuted by degree information with Martin's linnet tool. Each single panel implies a package's degree and dependency distribution properties. Combine the 9 separated hive plots to form a complete hive panel.

References

[1] Julian Bilcke. CPAN Explorer - An Interactive Exploration of the Perl Ecosystem. http://cpan-explorer.org/, 2009.
[2] Xiao Nan. R2S - PKU Vis Summer School. http://www.road2stat.com/cn/statistics/pku_vis_summer_school.html, 2010.
[3] Koon-Kiu Yana, Gang Fanga, Nitin Bhardwaja, Roger P. Alexandera, Mark Gerstein. Comparing Genomes to Computer Operating Systems in Terms of the Topology and Evolution of their Regulatory Control Networks. Proceedings of the National Academy of Sciences, 107 (20): 9186 - 9191, 2006.
[4] Danny Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Transactions on Visualization and Computer Graphics (TVCG; Proceedings of Vis/InfoVis 2006), Vol. 12, No. 5, 741 - 748, 2006.
[5] Martin Krzywinski. Hive Plots - Linear Layout for Network Visualization - Visually Interpreting Network Structure and Content Made Possible. http://www.hiveplot.com/, 2010.
[6] Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj. Exploratory Social Network Analysis with Pajek. Cambridge University Press, 2005.
[7] J.R. Heard. World Economic Forum Hive Plot. http://www.visualizing.org/visualizations/world-economic-forum-hive-plot/, 2010.

Rapid Prototyping R based Web Applications with Rook: Visualizing CVE-2011-0611 samples with Self-Organizing Maps

Inspired by Ruby's Rack Project, Jeffery Horner released his R package "Rook" [1] earlier this year. After trying to get several Rook applications running, I realized that Rook had avoided some certain disadvantages of Rapache. Rook is much more flexible and easier to learn.

Theoretically speaking, once the proper plugin is done, your app could then be deployed under any web servers such as apache/lighthttpd/nginx, etc. Another significant advantage of Rook is, it's friendly for debugging. As Rook takes Rhttpd as the default server, you could preview your app on-the-fly, without any complicated deploying process.

Here's a test app, which implements the creative binary file visualization method described in the VizSec and Virol papers [2] and [3]. We choose to visualize the CVE-2011-0611 samples, which were retrieved from [4]. By using the Rook::File application simultaneously, we could serve static (png) files.

Load required pkgs:

?View Code RSPLUS
1
2
3
require(Rook)
require(digest)
require(kohonen)

Write a Rook app:

?Download visbin.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
newapp = function(env) {
    req = Rook::Request$new(env)
    res = Rook::Response$new()
    res$write('Choose a Binary file to Train:\n')
    res$write('<form method="POST" enctype="multipart/form-data">\n')
    res$write('<input type="file" name="data">\n')
    res$write('xdim:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="xdim" value="12">\n')
    res$write('ydim:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="ydim" value="25">\n')
    res$write('ncolors:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="ncolors" value="8">\n')
    res$write('<input type="submit" name="Go!">\n</form>\n<br>')
 
    myNormalize = function (target) {
    return((target - min(target))/(max(target) - min(target)))
    }
 
    if (!is.null(req$POST())) {
    data = req$POST()[["data"]]
    hash = digest(data$tempfile, algo = "md5", file = TRUE)
    destFile = file(data$tempfile, "rb")
    k = floor((file.info(data$tempfile)$size/16)) - 2
    doneFile = readBin(con = destFile, what = "raw", n = 2 * 8 * k)
    close(destFile)
    tmpFile0 = rbind(doneFile[seq(1, (2 * 8 * k) - 1, 2)], doneFile[seq(2, (2 * 8 * k), 2)])
    tmpFile1 = paste(tmpFile0[1, ], tmpFile0[2, ], sep = "")
    initMat = matrix(strtoi(tmpFile1, 16L), ncol = 8, byrow = TRUE)
    normMat = myNormalize(initMat)
    trainedSOM = kohonen::som(normMat, grid = somgrid(xdim = req$POST()[["xdim"]], ydim = req$POST()[["ydim"]], "hexagonal"))
    png(paste("/tmp/", hash, ".png", sep = ""))
    plot(trainedSOM, type = "dist.neighbours", palette.name = rainbow, ncolors = as.numeric(req$POST()[["ncolors"]]), main = "")
    dev.off()
    res$write(paste("<img src='", s$full_url("pic"), "/", hash, ".png'", " />", sep = ""))
    }
    res$finish()
}

Initialize/Run the app:

?View Code RSPLUS
1
2
3
4
5
s = Rhttpd$new()
s$add(app = newapp, name = "visbin")
s$add(app = File$new("/tmp"), name = "pic")
s$start()
s$browse("visbin")

Firstly the app hashes the uploaded files then trains SOM models. As the training result differs each time, we may train more times to get the better one.

We use the U-Matrix to visualize the Self-Organizing Maps, The U-Matrix value of a particular unit is the average distance between the unit and its closest neighbors, then color was used to represent the value. Actually, the number of the color palette is critical, too much or too little may interfere the detection of potential cluster patterns.

There exists much more methods for dimensional reduction and visualization with R packages, you may refer to the R News (R Journal) paper [5].

visbin

It clearly shows that a cluster pattern appears in the lower right corner. It's reasonable to suspect the file was injected with some data that shouldn't be there.

The paper says it got bad results when visualizing macro viruses (embedded in Microsoft Office files). Actually, the CVE-2011-0611 sample are doc/xls files, but they are not macro viruses. They're hosts injected with harmful Adobe swf files. From this point of view, they're just like the infected executable files. So theory still applies.

A detail is, after uploading, the data$tempfile has a different MD5 with the original file, it gains extra hex 0D 0A (seems a new line) in the end. I don't quite understand how this happens. As we had deleted the last two lines of the file to form a proper matrix, the training data is not identical with the binary sample. Nothing influences for this case.

In summary, Rook connects the 3000+ available R package and web application development, just 40 lines of code were used to achieve a not-so-simple goal, it's really amazing.

References

[1] Rook - a web server interface for R.
[2] Visualizing Windows Executable Viruses Using Self-Organizing Maps, VizSec, 2004.
[3] Non-signature Based Virus Detection, Journal in Computer Virology, 2:163–186, 2006.
[4] Contagio Malware Dump. Apr. 8 CVE-2011-0611 Flash Player Zero day - SWF in DOC/ XLS - Disentangling Industrial Policy.
[5] Dimensional Reduction for Data Mapping, R News, Vol. 3/3, 2003.

In god we trust, or in love?

You don't have to believe in God, but you should believe in The Book.

        ——P. Erdos

即日启程, 在接下来几天没有网络的日子里, 我深信亲情能够温暖人心. 临行前突然又想写点东西. 标题的前半句出自《圣经》, 后半句与本文有点关系, 似乎还是王小帅一部电影的英文名. 中间还少了一环, 这环就是数学. 大神P.Erdos[1] 不太相信上帝, 但他相信世界上有一本超穷的“天书”(The Book), 那里包含了所有数学定理最简洁、最漂亮、最优雅的证明. 他对一个证明的最高赞誉就是:

It is from The Book!

按照God→Math→Love的逻辑, 爱亦应有自己的数学表述. 有了数学表述自然要可视化一下. 笛卡尔心形线令人心碎的爱情故事不知其真实程度, 不过确实够浪漫的. 不过这隐函数都学完了还画心形线就太out了, 今天咱也浪漫一把, 画个隐函数方程生成的三维心. 不知此图能否有幸入选The Book的图形部分?

MATLAB_3D_heart

图1: MATLAB生成三维心形

原始方程:

MATLAB_3D_heart_formula
自己画一个?

?View Code SCILAB
1
2
3
4
5
[x,y,z]=meshgrid(linspace(-3,3,120));
f=(x.^2+(9*y.^2)./4+z.^2-1).^3-((9*y.^2).*(z.^3))./80-(x.^2).*(z.^3);
p=patch(isosurface(x,y,z,f,0));
set(p, 'FaceColor', 'r', 'EdgeColor', 'n');
daspect([1 1 1]);view(3);camlight('right');lighting phong

[1] P.Erdos (1913-1996), 当代最伟大的数学家之一, 他一生中同485位合作者发表了1475篇数学论文. Erdos的研究领域主要是数论和组合数学, 但他的论文中涵盖的学科有逼近论、初等几何、集合论、概率论、数理逻辑、格与序代数结构、线性代数、群论、拓扑群、多项式、测度论、单复变函数、差分方程与函数方程、数列、Fourier分析、泛函分析、一般拓扑、代数拓扑、统计、数值分析、计算机科学、信息论等等. 美国数学学会(AMS)的《数学评论》杂志曾把数学划分为约六十个分支, Erdos的论文涉及了约40%.

斯人已逝, 思想长存.

2010年春节前

雪夜哈尔滨