Visualizing CRAN Package Dependency Network: Reveal Hidden Patterns with Martin Krzywinski's Hive Panel

1 Introduction

Studying the networks of online software community is fascinating. CPAN Explorer is a typical project aiming at analyzing the relationships in CPAN community [1]. CRAN package dependency network is another excellent source for this type of research. A state-of-art visualization is usually required to understand the network [2].

A common problem of conventional hairball style network visualization is: the graph becomes uninterpretable when it meets very large networks [3]. Researchers developed techniques such as hierarchical edge bundles [4] to tackle this problem. However, that's just too ideal for real world visualization problems. When it's emphasizing the strong connections in the network, the less strong part and the key details could possibly be ignored. Conventional visualization methods have constrained us to take a further step: revealing more hidden information of the internal structure (vertices, connectivity, etc.) in the network.

2 Hive Plots

Martin Krzywinski, author of the circular style genome visualization tool circos, proposed the hive plots in 2010 [5]. The most significant difference between hive plots and traditional layout is: its graphic design is based on the network's meaningful properties (vertices' degree, connectivity, centrality, etc.) instead of aesthetics. This design makes the graph interpretable and thus simplifies the presentation of relational data.

3 The Visualization

We selected 27 representative packages and visualize every three of them in one hive plot to make a 3x3 hive panel. Each panel represents a specific research field. Each node of the network is mapped on the axes by its degree information: green axis represents out-degree, orange axis represents in-degree, and purple axis combines in/out-degrees together. On each axis, outer nodes have higher degrees. The white connections, as the background, show us the overall connectivity of the network: the nodes have higher out-degrees are heavily depended by all ranges of nodes in the network, and the brighter parts of the arcs tend to indicate potential cluster patterns.

hiveplot

Click here to see a larger version.

hivepanel

Click here to see a larger version.

Meanwhile, we highlight three of the interested packages in each research field in one panel with three different colors to reveal its specific connection patterns. For the first panel, green connections represents lattice package. It's a fundamental package for graphic design in R, which is heavily depended by packages of all degrees. The purple connections represent the rgl package. It depends a little but it's depended by much more packages that distributed more discretely on the orange axis than lattice was. Orange lines represent the gplots package, which contains various miscellaneous tools for plotting. Obviously, the dependency patterns indicate its different role between the previous ones: it's more of a handy toolset for plotting, rather than a core package. The upper right panel shows us three of the data import/export packages: DBI, RODBC and RSQLite. Amazingly, althought they play different roles in the whole community, their dependency patterns are almost the same, except for a little difference between their degrees. The central panel, which highlights the finance-related packages fBasics, fOptions, and fGarch, reveals similar features.

Hive plots are relatively much more informative and comprehensive than conventional hairball-style visualizations, especially for large networks. You could discover much more interesting patterns in other panels yourself with this visualization.

The selected packages (ordered by panel 11, 12, 13, 21, 22 …) are:

  • Graphics: lattice / rgl / gplots (Green / Purple / Orange)
  • Programming: tools / rJava / Rcpp
  • Data Import/Export: DBI / RODBC / RSQLite
  • GUI Dev Tools & Framework: tcltk / gWidgets / Rcmdr
  • Finance: fBasics / fOptions / fGarch
  • Machine Learning: e1071 / rpart / randomForest
  • Regression Analysis: car / leaps / quantreg
  • Spatial and Geo Statistics: sp / maps / fields
  • Time Series Analysis: forecast / timeDate / tseries

4 Details

The creation of this visualization is really simple; highly reproducible for anyone who has a little knowledge of SNA [6]:

  1. The original data was retrieved from
    http://cran.r-project.org/bin/windows/contrib/2.13/PACKAGES
    on September 14, 2011. We only extracted the 'Depends' section of each package. After parsing and a bit of cleaning, a network consisted of 2,500 vertices and 5,900 arcs was constructed.
  2. To shrink the network, perform k-core analysis and extract the 4-6 cores partition to form a new network, a denser one, with less noise. Now it's reduced to about 600 vertices and 2,500 arcs.
  3. Draw the shrinked network permuted by degree information with Martin's linnet tool. Each single panel implies a package's degree and dependency distribution properties. Combine the 9 separated hive plots to form a complete hive panel.

References

[1] Julian Bilcke. CPAN Explorer - An Interactive Exploration of the Perl Ecosystem. http://cpan-explorer.org/, 2009.
[2] Xiao Nan. R2S - PKU Vis Summer School. http://www.road2stat.com/cn/statistics/pku_vis_summer_school.html, 2010.
[3] Koon-Kiu Yana, Gang Fanga, Nitin Bhardwaja, Roger P. Alexandera, Mark Gerstein. Comparing Genomes to Computer Operating Systems in Terms of the Topology and Evolution of their Regulatory Control Networks. Proceedings of the National Academy of Sciences, 107 (20): 9186 - 9191, 2006.
[4] Danny Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Transactions on Visualization and Computer Graphics (TVCG; Proceedings of Vis/InfoVis 2006), Vol. 12, No. 5, 741 - 748, 2006.
[5] Martin Krzywinski. Hive Plots - Linear Layout for Network Visualization - Visually Interpreting Network Structure and Content Made Possible. http://www.hiveplot.com/, 2010.
[6] Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj. Exploratory Social Network Analysis with Pajek. Cambridge University Press, 2005.
[7] J.R. Heard. World Economic Forum Hive Plot. http://www.visualizing.org/visualizations/world-economic-forum-hive-plot/, 2010.

Mapping CRAN Mirrors using R

CRAN_mirrors_map
昨天突然想看一下怎么用maps包画地图, 就做了一个CRAN镜像的地理位置数据试了一下. 城市地理位置数据主要来自maps包中的数据集world.cities. 画图时用到了maps和tripack这两个包, maps用于呈现地图, 而tripack可以根据给定点坐标计算并绘制Voronoi图/Delaunay三角剖分. 相比主流的C或C++甚至Python下的实现, 在R中做这种事情可以说是简单无比了. 同时也要慨叹R的扩展性是如此之好. 不过目前R中和计算几何相关的包似乎不多, 据我所知还有rcdd(cddlib在R中的接口)和geometry. 比如geometry包, 可以计算n-维上的Delaunay三角形, 等等. 如果某天哪位大牛可以替最强的计算几何算法库CGAL在R中做一个接口就好了.

CRAN_Mirrors_Voronoi

Voronoi Diagram of CRAN Mirrors on World Map [PDF(Vector), 120KB]

CRAN_Mirrors_Delaunay

Delaunay Triangulation of CRAN Mirrors on World Map [PDF(Vector), 120KB]

从图中看, 多数CRAN镜像分布在沿海地区, 内陆较少, 也极不均匀. 西欧密集分布了约30个镜像, 面积是其几倍大小的东侧大片内陆只有寥寥3、4个镜像. 非洲、南美洲的情况大致相同. 北美的镜像分布呈现比较均匀的态势, 至少看上去比较稳定, 不多也不少.

Dataset & R Code [Gzip, 1,932 bytes]

有两个问题是不得不考虑的:

  1. 地球是椭球体, 在球面上绘制Delaunay三角和Voronoi图, 需要通过计算三维空间中的凸包来实现 .. 但考虑到事实上太平洋中间没有CRAN的镜像, 正好十分自然地将现有镜像隔得很远, 就无视这个了 ..
  2. CRAN镜像发挥的影响力与各地光缆铺设情况的关系更紧密, 这时用地理意义上的世界地图就显得略不给力了 .. 使用能够体现网络分布情况的"地图"才是正解.