R连接PostgreSQL

最近一直在玩DICE三年前的神作《镜之边缘》, 顺便重温了一下一年前的《黑手党II》, 玩得简直是没有什么时间上来灌水了. 游戏之余偶然接触了一个PostgreSQL数据库, 简单记录一下.
mirrorsedge

R连接数据库有几套方案, 其实基本上就是DBI/ODBC/JDBC. 不过话说ODBC和JDBC神马的真是弱爆了. JDBC方案中那个鬼魂一般的依赖rJava, 真的是很难安装. 其实也有一种可能是AUR上的JDK打包得不好, 没能hold住R CMD javareconf的标准. 前些日子安装RWeka时专门研究过rJava的安装脚本, 卡在编译简单JNI程序这句一直不成功, 手动修改各种配置文件无果, 于是果断放弃 ...
继续阅读

Rapid Prototyping R based Web Applications with Rook: Visualizing CVE-2011-0611 samples with Self-Organizing Maps

Inspired by Ruby's Rack Project, Jeffery Horner released his R package "Rook" [1] earlier this year. After trying to get several Rook applications running, I realized that Rook had avoided some certain disadvantages of Rapache. Rook is much more flexible and easier to learn.

Theoretically speaking, once the proper plugin is done, your app could then be deployed under any web servers such as apache/lighthttpd/nginx, etc. Another significant advantage of Rook is, it's friendly for debugging. As Rook takes Rhttpd as the default server, you could preview your app on-the-fly, without any complicated deploying process.

Here's a test app, which implements the creative binary file visualization method described in the VizSec and Virol papers [2] and [3]. We choose to visualize the CVE-2011-0611 samples, which were retrieved from [4]. By using the Rook::File application simultaneously, we could serve static (png) files.

Load required pkgs:

?View Code RSPLUS
1
2
3
require(Rook)
require(digest)
require(kohonen)

Write a Rook app:

?Download visbin.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
newapp = function(env) {
    req = Rook::Request$new(env)
    res = Rook::Response$new()
    res$write('Choose a Binary file to Train:\n')
    res$write('<form method="POST" enctype="multipart/form-data">\n')
    res$write('<input type="file" name="data">\n')
    res$write('xdim:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="xdim" value="12">\n')
    res$write('ydim:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="ydim" value="25">\n')
    res$write('ncolors:\n')
    res$write('<form method="POST">\n')
    res$write('<input type="text" name="ncolors" value="8">\n')
    res$write('<input type="submit" name="Go!">\n</form>\n<br>')
 
    myNormalize = function (target) {
    return((target - min(target))/(max(target) - min(target)))
    }
 
    if (!is.null(req$POST())) {
    data = req$POST()[["data"]]
    hash = digest(data$tempfile, algo = "md5", file = TRUE)
    destFile = file(data$tempfile, "rb")
    k = floor((file.info(data$tempfile)$size/16)) - 2
    doneFile = readBin(con = destFile, what = "raw", n = 2 * 8 * k)
    close(destFile)
    tmpFile0 = rbind(doneFile[seq(1, (2 * 8 * k) - 1, 2)], doneFile[seq(2, (2 * 8 * k), 2)])
    tmpFile1 = paste(tmpFile0[1, ], tmpFile0[2, ], sep = "")
    initMat = matrix(strtoi(tmpFile1, 16L), ncol = 8, byrow = TRUE)
    normMat = myNormalize(initMat)
    trainedSOM = kohonen::som(normMat, grid = somgrid(xdim = req$POST()[["xdim"]], ydim = req$POST()[["ydim"]], "hexagonal"))
    png(paste("/tmp/", hash, ".png", sep = ""))
    plot(trainedSOM, type = "dist.neighbours", palette.name = rainbow, ncolors = as.numeric(req$POST()[["ncolors"]]), main = "")
    dev.off()
    res$write(paste("<img src='", s$full_url("pic"), "/", hash, ".png'", " />", sep = ""))
    }
    res$finish()
}

Initialize/Run the app:

?View Code RSPLUS
1
2
3
4
5
s = Rhttpd$new()
s$add(app = newapp, name = "visbin")
s$add(app = File$new("/tmp"), name = "pic")
s$start()
s$browse("visbin")

Firstly the app hashes the uploaded files then trains SOM models. As the training result differs each time, we may train more times to get the better one.

We use the U-Matrix to visualize the Self-Organizing Maps, The U-Matrix value of a particular unit is the average distance between the unit and its closest neighbors, then color was used to represent the value. Actually, the number of the color palette is critical, too much or too little may interfere the detection of potential cluster patterns.

There exists much more methods for dimensional reduction and visualization with R packages, you may refer to the R News (R Journal) paper [5].

visbin

It clearly shows that a cluster pattern appears in the lower right corner. It's reasonable to suspect the file was injected with some data that shouldn't be there.

The paper says it got bad results when visualizing macro viruses (embedded in Microsoft Office files). Actually, the CVE-2011-0611 sample are doc/xls files, but they are not macro viruses. They're hosts injected with harmful Adobe swf files. From this point of view, they're just like the infected executable files. So theory still applies.

A detail is, after uploading, the data$tempfile has a different MD5 with the original file, it gains extra hex 0D 0A (seems a new line) in the end. I don't quite understand how this happens. As we had deleted the last two lines of the file to form a proper matrix, the training data is not identical with the binary sample. Nothing influences for this case.

In summary, Rook connects the 3000+ available R package and web application development, just 40 lines of code were used to achieve a not-so-simple goal, it's really amazing.

References

[1] Rook - a web server interface for R.
[2] Visualizing Windows Executable Viruses Using Self-Organizing Maps, VizSec, 2004.
[3] Non-signature Based Virus Detection, Journal in Computer Virology, 2:163–186, 2006.
[4] Contagio Malware Dump. Apr. 8 CVE-2011-0611 Flash Player Zero day - SWF in DOC/ XLS - Disentangling Industrial Policy.
[5] Dimensional Reduction for Data Mapping, R News, Vol. 3/3, 2003.

结合豆瓣基础API学习XML包

很久以前在R-Forge上注册过一个RDouban项目, 想用豆瓣提供的API做点好玩的事情. 可惜后来只写了个开头, 感兴趣的童鞋可以无条件认领. 在这里结合豆瓣的基础API, 非常简略地写一下用XML包读数据的基本问题.

1 XPath

花十分钟学习XPath语法.
熟练后可使用Firebug等调试工具直接提取. 此外, 要特别注意XML命名空间问题. (感谢yixuan提醒)

2 Douban API

花n分钟阅读"豆瓣API参考手册".
用户的评论、收藏、广播、豆邮等交互功能往往需要先进行OAuth认证, 建议阅读RFC5849以充分理解OAuth协议. 这块目前也有ROAuth包可以实现, 不过与读数据没什么关系, 此略.

继续阅读

使用Google Perftools和kcachegrind深入剖析R程序性能瓶颈

“颐和园在我家北面,假如没有北这个方向的话,我就只好向南走,越过南极和北极,行程四万余公里到达那里。”

——《革命时期的爱情》

1 剖析与Profiling

这里的"剖析"对应的单词是"profiling", 中文似乎没有词语能精准地表达出原词的内涵, 与"shell"的情况有点类似, 所以就不去刻意翻译了. 由于没有学过软件工程, 不吝先贴一段wiki上profiling的定义[1]:

In software engineering, program profiling, software profiling or simply profiling, a form of dynamic program analysis (as opposed to static code analysis), is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

要提高R程序的运行速度, 不仅仅需要剽悍的机器(单核高频, 多核并行, GPU)和高效的代码(向量化, 混合编程), 寻找程序中的性能瓶颈并进行有针对的优化也是很重要的, "找到北这个方向", 就是profiling的意义所在.

R中其实自带了几个最简单的profiling工具, 我们或多或少都接触过:

  • base::system.time() —— 简陋的计时秒表
  • utils::Rprof() —— 对CPU的简易profile工具
  • utils::Rprofmem() —— 对内存的简易profile工具

关于这些函数的使用, 可以参看Ross Ihaka的说明[2]. 与此同时, CRAN上profr(Hadley Wickham)和proftools(Luke Tierney)两个微型包均提供了可视化Rprof()函数输出结果的能力. 但是, 这类简单的profiling只将程序拆解了到了单个R运算的层次, 没有提供更深一层, 即profiling compiled code的功能. R原生支持这个特性, 但需要在编译时对默认选项进行简单的修改[4].

2 安装Google Perftools

翻看Dirk Eddelbuettel牛在useR2010上的RhpcTutorial[3], 其中提到了Google员工开发的Google Perftools可以profiling compiled code. 试了一下, 感觉是个不错的工具, 配合kcachegrind, 还可以将结果可视化.

测试环境: Arch Linux x86_64

AUR上已有一位大人在维护google-perftools(源代码安装):

sudo yaourt -S google-perftools

Fedora:

sudo yum install google-perftools

Ubuntu:

sudo apt-get install google-perftools

不过Ubuntu/Fedora仓库中的二进制包可能比较陈旧了, 无妨直接自行编译最新版本:

svn checkout http://google-perftools.googlecode.com/svn/trunk/

在Arch x86_64下, google-perftools的默认安装路径为

/usr/bin

库文件libprofiler.so位于

/usr/lib

Google Perftools工具集中除了名为pprof的CPU profiler以外, 还提供了堆内存泄漏检测/使用情况统计等工具. 这里我们只关注pprof就可以了: 它通过CPU中断采样的方式统计每个函数被采样的次数, 占总采样次数的百分比, 调用的子函数的被采样次数等等(可以说"剖析"在此处还是比较恰当的). 最后通过这些信息寻找程序的(CPU)性能瓶颈.

3 使用Google Perftools

要和R一起使用, pprof有两种可行的运行方式, 第一种比较硬朗: 在编译选项中直接加入对libprofiler.so的引用, R在运行时就会自动加载libprofiler库:

wget http://cran.r-project.org/src/base/R-2/R-2.13.1.tar.gz
tar -xf R-2.13.1.tar.gz
cd R-2.13.1
 
export MAIN_CFLAGS="-pg"
export MAIN_FFLAGS="-pg"
export MAIN_LDFLAGS="-pg"
export LDFLAGS="-lprofiler"
# 也可显式指定路径:
# export LDFLAGS="-L/usr/lib -lprofiler"
./configure --enable-R-shlib

参数 -pg 打开R的profiling支持, 设定LDFLAGS用以连接libprofiler库.

configure完成:

R is now configured for x86_64-unknown-linux-gnu

Source directory: .
Installation directory: /usr/local

C compiler: gcc -std=gnu99 -g -O2
Fortran 77 compiler: gfortran -g -O2

C++ compiler: g++ -g -O2
Fortran 90/95 compiler: gfortran -g -O2
Obj-C compiler:

Interfaces supported: X11
External libraries: readline, ICU, lzma
Additional capabilities: PNG, JPEG, TIFF, NLS, cairo
Options enabled: shared R library, shared BLAS, R profiling, Java

Recommended packages: yes

结果无误, 开始编译安装:

make && sudo make install

这样, 连接了libprofiler.so的R编译成功.

第二种使用pprof的方式则相对委婉: 使用时动态预加载库文件就可以了, Dirk大人的RhpcTutorial中已经给出说明, 此略.

选取《Wringting R Extension》 3.2节中给出的一个例子进行测试:

?Download profiling.R
#!/usr/local/lib64/R/bin/Rscript
suppressMessages(library(MASS))
suppressMessages(library(boot))
storm.fm <- nls(Time ~ b*Viscosity/(Wt - c), stormer, 
                start = c(b=29.401, c=2.2183))
st <- cbind(stormer, fit=fitted(storm.fm))
storm.bf <- function(rs, i) {
    st$Time <-  st$fit + rs[i]
    tmp <- nls(Time ~ (b * Viscosity)/(Wt - c), st, start = coef(storm.fm))
    tmp$m$getAllPars()
}
rs <- scale(resid(storm.fm), scale = FALSE)
storm.boot <- boot(rs, storm.bf, R = 500)

将profiling结果记录到文件:

chmod 755 profiling.R
CPUPROFILE=rprof.out ./profiling.R

这里在写R文件时用了一点点技巧, 使得它能够支持类Unix系统的Shebang特性而直接执行, 参考[7], [8].

执行完毕后, 我们即可使用pprof来分析输出的结果文件(一个二进制文件!)了. pprof可以将此文件解析成你想要的各种可读的形式. 其参数如下:

pprof --option [ --focus=< regexp > ] [ --ignore=< regexp > ]
[--line or addresses or functions] 可执行文件路径 结果文件路径

方括号为可选项目, < regexp >为正则表达式.

具体的选项分为几组. 其中输出格式的基本可选项为:

text, callgrind, gv, evince, web, symbols, dot, ps, pdf, svg, gif, raw, list=< regexp >, disasm=< regexp >.

text 表示字符统计输出形式, 其它均对应各自的图形格式;
list=< regexp > 表示输出匹配正则表达式的函数的源代码;
diasm=< regexp > 表示输出匹配正则表达式的函数的反汇编代码.

其他比较重要的参数:

--focus=< regexp > 表示只统计函数名匹配正则表达式的函数的采样;
--ignore=< regexp > 表示不统计函数名匹配正则表达式的函数的采样;
[--line or addresses or functions] 表示生成的统计是基于代码行, 指令地址还是函数的, 默认是函数.

这里仅输出文字型结果:

pprof --cum --text /usr/local/lib64/R/bin/Rscript rprof.out | less

结果中的前15位:

Total: 254 samples
2 0.8% 0.8% 213 83.9% Rf_applyClosure
24 9.4% 10.2% 213 83.9% Rf_eval
0 0.0% 10.2% 213 83.9% do_begin
0 0.0% 10.2% 212 83.5% do_set
0 0.0% 10.2% 206 81.1% do_internal
0 0.0% 10.2% 148 58.3% do_lapply
7 2.8% 13.0% 123 48.4% Rf_evalList
0 0.0% 13.0% 107 42.1% Rf_ReplIteration
0 0.0% 13.0% 104 40.9% R_ReplConsole
0 0.0% 13.0% 102 40.2% Rf_usemethod
0 0.0% 13.0% 100 39.4% run_Rmainloop
0 0.0% 13.0% 96 37.8% main
0 0.0% 13.0% 92 36.2% __libc_start_main
0 0.0% 13.0% 85 33.5% do_usemethod
1 0.4% 13.4% 85 33.5% forcePromise

输出结果中, 每行对应着一个函数的统计:

  • 第1, 2列是该函数的本地采样(不包括被该函数调用的函数中的采样次数)次数和比例;
  • 第3列是该函数本地采样次数占当前所有已统计函数的采样次数之和的比例;
  • 第4, 5列是该函数的累计采样次数(包括其调用的函数中的采样次数)和比例.

如果你的系统中安装了gnu-gv或evince, 即可直接即刻显示一幅无码清晰大图(ps/pdf):

pprof --gv /usr/local/lib64/R/bin/Rscript rprof.out
pprof --evince /usr/local/lib64/R/bin/Rscript rprof.out

rprof

其他几个比较常用的选项可能是

生成PDF:

pprof --pdf /usr/local/lib64/R/bin/Rscript rprof.out > rprof.pdf

生成SVG:

pprof --svg /usr/local/lib64/R/bin/Rscript rprof.out > rprof.svg

生成GraphViz所支持的dot格式:

pprof --dot /usr/local/lib64/R/bin/Rscript rprof.out > rprof.dot

当然, 要想读懂图中的内容, 从而针对某些部分进行优化, 还需要对R的底层比较熟悉才行: 最起码要了解涉及到的C函数的具体功能.

4 配合kcachegrind可视化profile结果

pprof可将输出转化为强大的Valgrind工具集中的组件Callgrind可采用的格式, 配合KCachegrind这个图形前端, 即可对结果进行简单的可视化, 能够交互哦亲:

# For Arch Linux
# sudo pacman -S kdesdk-kcachegrind
pprof --callgrind /usr/local/lib64/R/bin/Rscript rprof.out > rprof.callgrind
kcachegrind rprof.callgrind

kcachegrind

其实看上去KCachegrid就是做了一个最普通的树可视化, 所以理论上我们其实可以用角度各异的无数种手段展示profiling结果: 就是画一棵树嘛. 不过KCachegrind中可以与图形交互, 进一步的分析很方便, 大家可以自己进一步体验.

5 其他工具(sprof和oprofile)

Writing R Extensions[6]提到另外两个可供Linuxer选择的工具: sprof和oprofile, 我没有实验, 感兴趣的同学不妨实践一下.

6 参考

[1] Wikipedia - Profiling (computer programming).
[2] Ross Ihaka. Writing Efficient Programs in R (and Beyond).
[3] Dirk Eddelbuettel. Introduction to High-Performance Computing with R. pp. 29-35
[4] R Installation and Administration. Version 2.13.0 (2011-04-13) Appendix B.1, B.7.
[5] 冯文龙. 使用google-perftools剖析程序性能瓶颈.
[6] Writing R Extensions. Version 2.13.0 (2011-04-13) pp. 69-71.
[7] Wikipedia - Shebang.
[8] shebang line not working in R script.

给力的Rcpp和不给力的RcppArmadillo

在炎热与抑郁的夏天, 继续博客上的灌水大业才是正经事. xfocus水区有名曰:

灌水王潮

我一直幻想着有朝一日R能够同时拥有简单的语法和C的效率, 直到今天认真看了一下Rcpp, 不懂PROTECT UNPROTECT的孩子终于伤得起了 ... 不得不承认这发了JSS的软件包还是有一定独到之处的: 谁叫人家是SCI呢, 哈哈哈哈哈哈哈. 好吧自行吐槽其实我out了 ... 关于Rcpp, 不周山上早有此文.

Rcpp的作者在文档里说在Revolution R下没测试过可能需要修改. 我用icpc替代g++倒是能编译成功, 除了中间闪过几个warning.

安装之前先建立符号链接

sudo ln -s /opt/intel/composerxe-2011.3.174/bin/ia32/icpc /bin/icpc

否则会找不到icpc.

1 Rcpp + inline

如果不是写正式的包, Rcpp + inline让一切浮云都浮云了. 最简单的例子, 来自 [1]:

?View Code RSPLUS
require(Rcpp)
require(inline)
 
src = '
Rcpp::NumericVector xa(a);
Rcpp::NumericVector xb(b);
int n_xa = xa.size();
int n_xb = xb.size();
 
Rcpp::NumericVector xab(n_xa + n_xb - 1);
 
for (int i = 0; i < n_xa; i++)
  for (int j = 0; j < n_xb; j++)
    xab[i + j] += xa[i] * xb[j];
 
return xab;
'
 
fun = cxxfunction(
       signature(a = "numeric", b = "numeric"), 
       src, plugin = "Rcpp")
 
fun(1:3, 1:4)
# [1] 1 4 10 16 17 12

以上在Intel系编译器出来的R/Rcpp下通过.

2 Rcpp + RcppArmadillo + inline

这Armadillo是又一个线性代数库.

话说之前我一直以为这词是穿山甲的意思, 因为有个经典壳就叫Armadillo, 然后我就被Inception了. 今天一查这词的真正意思是"犰狳" ...

Armadillo在网站上说自己支持MKL和ACML. 但是用icpc编译的时候蹦了n+10086个warning, 编到最后竟然就

Error in dyn.load(file, DLLpath = DLLpath, ...) :
无法载入共享目标对象‘/home/jimmy/R/i686-pc-linux-gnu-library/2.13/RcppArmadillo/libs/RcppArmadillo.so’::
/home/jimmy/R/i686-pc-linux-gnu-library/2.13/RcppArmadillo/libs/RcppArmadillo.so: undefined symbol: _ZN4arma12arma_version5patchE

了! 编译都通不过的孩子, 你就伤得起么 ... 《编程序的小女孩》有云:

她刚把头伸出去,想看的仔细一些,程序crash了,大显示器不见了。她坐在那儿,眼前的破显示器上一行刺眼的segment fault。

此处太过曲折, 篇幅所限略去不表. 依然是最简单的例子, 来自 [2]:

?View Code RSPLUS
require(Rcpp)
require(RcppArmadillo)
require(inline)
 
src = '
arma::colvec x = Rcpp::as<arma::colvec> (x_);
arma::mat Y = Rcpp::as<arma::mat> (Y_ );
arma::colvec z = Rcpp::as<arma::colvec> (z_);
double result = arma::as_scalar(arma::trans(x) * arma::inv(Y) * z);
return Rcpp::wrap(result);
'
 
fx = cxxfunction(
      signature(x_ = "numeric", Y_ = "matrix", z_ = "numeric"),
      src, plugin = "RcppArmadillo")
 
fx(1:4, diag(4), 1:4)
# [1] 30

3 无题

给自己翻了个Rcpp-quickref [3], 没加颜色. 原文档里的高亮风格就不吐槽了, 同一段code snippet中serif mono各种混用难道很酷么 ...

Rcpp-quickref-cn.pdf

感觉作者挺不小心的, 括号前后的空格写得相当随意, 也许是多人共同蹂躏文档的结果 ... 有时间要研究一下RcppGSL.

4 Ref

[1] Rcpp: Seanless R and C++ Integration
[2] Frequently Asked Questions about Rcpp
[3] Rcpp Quick Reference Guide

5 无题又见无题

某天我突然冒出来一个十分惊悚的想法, 那就是去读物理, 而且方向必须是理论物理 ... 然后就被这白日梦吓醒了 ... 要知道我在大学里唯一学过一个学期的物理, 内容是刚体机械波热力学什么的, 去上过的课绝对没超过一半, 而且最后成绩是C ... 这 ... 难道是Crysis2 OST听多了么 ... 还是万有引力之虹看多了 ...

给我一万支秃笔, 让我仿黄鲁直, 削长枪大戟。
给我十千斗浊酒, 让我学李太白, 灭沧海横流。

霜刃未曾试

不试哪成?

?View Code RSPLUS
0
1
2
3
4
5
6
7
8
9
10
x = rep(NA, 10)
test1 <- function(m, n) {
  for (i in m:n) {
    x[i] <- i
    }
}
> print(x)
 [1] NA NA NA NA NA NA NA NA NA NA
> test1(3, 5)
> print(x)
 [1] NA NA NA NA NA NA NA NA NA NA
?View Code RSPLUS
0
1
2
3
4
5
6
7
test2 <- function(m, n) {
  for (i in m:n) {
    x[i] <<- i
    }
}
> test2(3, 5)
> print(x)
 [1] NA NA  3  4  5 NA NA NA NA NA