<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
><channel><title>R2S</title> <atom:link href="http://www.road2stat.com/cn/feed" rel="self" type="application/rss+xml" /><link>http://www.road2stat.com/cn</link> <description>江湖一散人</description> <lastBuildDate>Tue, 10 Apr 2012 05:06:50 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.2</generator> <item><title>收取点与Feedsky：一次逃离</title><link>http://www.road2stat.com/cn/others/feedsky.html</link> <comments>http://www.road2stat.com/cn/others/feedsky.html#comments</comments> <pubDate>Sun, 08 Apr 2012 05:01:46 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[聚类失效]]></category> <category><![CDATA[30天退出服务]]></category> <category><![CDATA[feed]]></category> <category><![CDATA[FeedSky]]></category> <category><![CDATA[去中心化]]></category> <category><![CDATA[逃离]]></category> <category><![CDATA[重定向]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=1066</guid> <description><![CDATA[Feedsky 已经处于半死不活状态很久了，今天用他家的「30天退出服务」删除了托管的 Feed。也就是说，原来通过 Feedsky 的收取点订阅的同学，在一个月之内将被自动重定向到原始收取点。 推荐原来通过 Feedsky 订阅的同学们直接改订原始收取点，因为如果哪天 Feedsky 倒下了，重定向也就木有了。 前两天看了美版「龙纹身」，Rooney Mara 果然不负众望。 去中心化的网络服务自己却成了新的中心。]]></description> <content:encoded><![CDATA[<p>Feedsky 已经处于半死不活状态很久了，今天用他家的「30天退出服务」删除了托管的 Feed。也就是说，原来通过 Feedsky 的收取点订阅的同学，在一个月之内将被自动重定向到<a href="http://www.road2stat.com/cn/feed">原始收取点</a>。</p><p>推荐原来通过 Feedsky 订阅的同学们直接改订原始收取点，因为如果哪天 Feedsky 倒下了，重定向也就木有了。</p><p>前两天看了美版「龙纹身」，Rooney Mara 果然不负众望。</p><p>去中心化的网络服务自己却成了新的中心。</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/others/feedsky.html/feed</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Visualizing Long Time Series Data with lattice, ggplot2 and D3.js</title><link>http://www.road2stat.com/cn/statistics/tsplot.html</link> <comments>http://www.road2stat.com/cn/statistics/tsplot.html#comments</comments> <pubDate>Sat, 07 Apr 2012 20:05:49 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[activity]]></category> <category><![CDATA[calendar plot]]></category> <category><![CDATA[D3.js]]></category> <category><![CDATA[ggplot2]]></category> <category><![CDATA[heatmap]]></category> <category><![CDATA[lattice]]></category> <category><![CDATA[online]]></category> <category><![CDATA[pattern]]></category> <category><![CDATA[social network service]]></category> <category><![CDATA[temporal data]]></category> <category><![CDATA[time series]]></category> <category><![CDATA[Visualization]]></category> <category><![CDATA[xyplot]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=1040</guid> <description><![CDATA[1 Introduction Personally, I was always wondering how the other people use their social network accounts. Like, on which time during the day do most people get online to post, comment and share? How many accounts are active in the &#8230; <a href="http://www.road2stat.com/cn/statistics/tsplot.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2012/04/fb.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2012/04/fb.jpg" alt="Facebook" title="Facebook" width="460" height="272" class="aligncenter size-full wp-image-1052" /></a></p><h2>1 Introduction</h2><p>Personally, I was always wondering how the other people use their social network accounts. Like, on which time during the day do most people get online to post, comment and share? How many accounts are active in the late night? Alternatively, is the user online pattern the same everyday during a week, a month or a year?</p><p>Such questions keeps popping out of my head, so I scraped the numbers of active accounts at a five-minute interval during a week. The data source is renren.com (NASDAQ: RENN) which could be treated as China's Facebook. The tools involved were a simple R script which did the actual scraping and a single line of cron rule which executed the script every five minutes.</p><p><span id="more-1040"></span></p><p>[ <a href="http://www.road2stat.com/cn/wp-content/attachments/2012/04/scrape.R">The R Script for Data Scraping</a> ]</p><p>It's really redundant to use the XML package for such a simple task. But it could be easier to extend when you want to scrape the specific users or the complete online user list. The cron rule is</p><pre>0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/local/lib/R/bin/Rscript /home/user/scrape.R</pre><p>which could be added to the task list using the <code>crontab -e</code> command.</p><p>[ <a href="http://www.road2stat.com/cn/wp-content/attachments/2012/04/record.csv">The Retrieved Data</a> ]</p><h2>2 lattice - Multipanel View</h2><p>It would be unrealistic to display the data with a single static line in one plot, for the frequency of the time-series is too high. One traditional solution is cutting the data into multiple pieces and display them separately in <strong>multiple panels</strong>. The could be achieved with R's lattice [1] package:</p><pre>
require(lattice)
rren = read.csv('record.csv', header = FALSE,
                col.names = c('Count', 'Time'))

rren$Time = rren$Time + 28800
rren$Time = as.POSIXct(rren$Time, origin = '1970-01-01')

xyplot(Count ~ Time | equal.count(as.numeric(Time), 7, overlap = 0),
       data = rren, type = 'l', aspect = 'xy',
       strip = FALSE, xlab = '', ylab = '',
       scales = list(x = list(relation = "sliced", axs = "i"),
                     y = list(alternating = FALSE)))
</pre><div id="attachment_1041" class="wp-caption aligncenter" style="width: 470px"><a href="http://www.road2stat.com/cn/wp-content/attachments/2012/04/lattice_multipanel.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2012/04/lattice_multipanel.png" alt="Fig.1 Multipanel Display with lattice" title="Fig.1 Multipanel Display with lattice" width="460" height="560" class="size-full wp-image-1041" /></a><p class="wp-caption-text">Fig.1 Multipanel Display with lattice</p></div><p>From Figure.1 we could see the overall trends for each weekday is almost the same, except that the lowest (Saturday) panel's morning rising slope is more flat than other panels. But why? Maybe after a week's work, people tend to have a nice sleep in Saturday morning.</p><p>One advantage of this method is its <strong>scalability</strong>: it could be easily applied to a much longer time-series or multivariate time-series data. This plot highlights the overall pattern of each individual day, and makes it easy to compare if there's any different overall trends between distinct days. We also applied the 45-degree banking algorithm to use the space effectively, with lattice's built-in parameter <code>aspect = 'xy'</code>. However, it's a bit difficult to accurately compare the numbers of active users across individual panels. As the recorded data grows, adding more panels would only result in a tedious plot.</p><h2>3 D3.js - Interactive View</h2><p>Adding some interaction capabilities to the visualization is another approach. With D3.js, we could achieve this [2] by it's brush, pan and zoom support.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2012/04/d3ts.html"><strong>Click here</strong></a> to see the interactive display (Google Chrome recommended).</p><div id="attachment_1042" class="wp-caption aligncenter" style="width: 510px"><a href="http://www.road2stat.com/cn/wp-content/attachments/2012/04/d3_interactive.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2012/04/d3_interactive.png" alt="Fig.2 Interactive Display with D3.js" title="Fig.2 Interactive Display with D3.js" width="500" height="392" class="size-full wp-image-1042" /></a><p class="wp-caption-text">Fig.2 Interactive Display with D3.js</p></div><p>In this interactive view, we could zoom in and zoom out or move the slider on the x-axis to examine the partial patterns thoroughly. But personally, I think it is not a good idea to change the banking degrees arbitrarily.</p><p>This method is somehow more <strong>intuitive</strong>, and it would be fun to explore the data directly in a web browser. We could optimize this visualization by adding the ability to selecting multiple time intervals at the same time.</p><h2>4 ggplot2 - Heatmap View</h2><p>Time to change to a completely different view. We could map the numerical data to color gradients, then display the data with a 2D heatmap [3]:</p><pre>
require(ggplot2)
rren = read.csv('record.csv', header = FALSE,
                col.names = c('Count', 'Time'))

rren$Time = NULL
rren$Hour = rep(c(rep(1, 12),  rep(2, 12),  rep(3, 12),  rep(4, 12),
                  rep(5, 12),  rep(6, 12),  rep(7, 12),  rep(8, 12),
                  rep(9, 12),  rep(10, 12), rep(11, 12), rep(12, 12),
                  rep(13, 12), rep(14, 12), rep(15, 12), rep(16, 12),
                  rep(17, 12), rep(18, 12), rep(19, 12), rep(20, 12),
                  rep(21, 12), rep(22, 12), rep(23, 12), rep(24, 12)),
                  7) - 1
rren$Minute = rep(seq(0, 55, 5), 168)
rren$Weekday = as.factor(c(rep('2 - Sun', 288), rep('3 - Mon', 288),
                           rep('4 - Tue', 288), rep('5 - Wed', 288),
                           rep('6 - Thr', 288), rep('7 - Fri', 288),
                           rep('1 - Sat', 288)))

ggplot(rren, aes(Hour, Minute)) +
  geom_tile(aes(fill = Count), colour = "white") +
  scale_fill_gradient(low = "#F7FBFF", high = "#08306B") +
  facet_grid(Weekday ~ .)
</pre><div id="attachment_1043" class="wp-caption aligncenter" style="width: 500px"><a href="http://www.road2stat.com/cn/wp-content/attachments/2012/04/ggplot2_heatmap.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2012/04/ggplot2_heatmap.png" alt="Fig.3 Heatmap Display with ggplot2" title="Fig.3 Heatmap Display with ggplot2" width="490" height="540" class="size-full wp-image-1043" /></a><p class="wp-caption-text">Fig.3 Heatmap Display with ggplot2</p></div><p>Figure.3 shows more interesting patterns. In Monday morning, there are more active users (more than any single day during the week), it's probably because people really need to cheer themselves up after two days' rest to refocus on their jobs. In addition, for the active users number reaches the peak, it's the golden time to <em>socialize</em> between 20:00 and 22:00 everyday. There are also more people try to stay late in Saturday night.</p><p>An interesting extended reading is the <em>calendar plot</em> of such time-series data (despite it's recorded daily and thus it couldn't be displayed perfectly by a rectangular) published in 2011 [4]. It visualized airline delays and cancellations in America over 21 years [5]. There's an R function [6] for drawing this plot. D3.js also has an implementation [7] suitable for interactive exploration.</p><h2>References</h2><p>[1] Deepayan Sarkar. (2008) Lattice: Multivariate Data Visualization with R. pp. 143-145.</p><p>[2] Mike Bostock. (2012) <a href="http://bl.ocks.org/1667367">Focus + Context (via Brushing)</a>.</p><p>[3] Learning R. (2010) <a href="http://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/">ggplot2: Quick Heatmap Plotting</a>.</p><p>[4] Rick Wicklin. (2011) Visualizing Airline Delays and Cancelations. Journal of Computational and Graphical Statistics. Volume 20, Issue 2, 284-286.</p><p>[5] Rick Wicklin, Robert Allison. (2009) <a href="http://stat-computing.org/dataexpo/2009/posters/wicklin-allison.pdf">Congestion in the Sky - Visualizing Domestic Airline Traffic with SAS</a>.</p><p>[6] Revolution Analytics. (2009) <a href="http://blog.revolutionanalytics.com/2009/11/charting-time-series-as-calendar-heat-maps-in-r.html">Charting time series as calendar heat maps in R</a>.</p><p>[7] Mike Bostock. (2011) <a href="http://boothead.github.com/d3/ex/calendar.html">D3.js Examples - Calendar View</a>.</p><p>[8] Andy McNeice. (2011) <a href="http://procrun.com/2011/11/11/what-5728-986-miles-look-like/">What 5,728.986 miles look like</a>.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/tsplot.html/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Linear and Circular Layouts for Network Visualization</title><link>http://www.road2stat.com/cn/statistics/network_visualization_layouts.html</link> <comments>http://www.road2stat.com/cn/statistics/network_visualization_layouts.html#comments</comments> <pubDate>Fri, 30 Mar 2012 11:12:12 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[circos]]></category> <category><![CDATA[circular]]></category> <category><![CDATA[hiveplot]]></category> <category><![CDATA[layout]]></category> <category><![CDATA[linear]]></category> <category><![CDATA[network]]></category> <category><![CDATA[Visualization]]></category> <category><![CDATA[可视化]]></category> <category><![CDATA[基因组]]></category> <category><![CDATA[环形布局]]></category> <category><![CDATA[线性布局]]></category> <category><![CDATA[网络数据]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=1032</guid> <description><![CDATA[昨天在讨论班上简单介绍了网络可视化的两种布局。 幻灯片在此： Linear and Circular Layouts for Network Visualization [PDF, 9.5M] 讲的时候竟然把达沃斯说成了在印度，自己还浑然不觉，脑子秀逗了 。。。 当时难道在想Mahalanobis？]]></description> <content:encoded><![CDATA[<p>昨天在讨论班上简单介绍了网络可视化的两种布局。</p><p>幻灯片在此：</p><p><a href="http://www.road2stat.com/xiaonan/files/linear-and-circular-layouts-for-network-visualization-xiaonan.pdf">Linear and Circular Layouts for Network Visualization</a> [PDF, 9.5M]</p><p>讲的时候竟然把达沃斯说成了在印度，自己还浑然不觉，脑子秀逗了 。。。</p><p>当时难道在想Mahalanobis？</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/network_visualization_layouts.html/feed</wfw:commentRss> <slash:comments>8</slash:comments> </item> <item><title>《R in Action》中译本第一章部分试读</title><link>http://www.road2stat.com/cn/r_language/ria.html</link> <comments>http://www.road2stat.com/cn/r_language/ria.html#comments</comments> <pubDate>Thu, 15 Mar 2012 04:03:10 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[R]]></category> <category><![CDATA[R in Action]]></category> <category><![CDATA[R实战]]></category> <category><![CDATA[翻译]]></category> <category><![CDATA[试读]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=1026</guid> <description><![CDATA[不为技术唱赞歌，只为吐槽说人话。 年初的时候更新过一条状态，算是今年一个小小的愿景，『2012，像卡马克一样工作，像柳智宇一样生活。』（请问：句号的位置有没有错误？） 两个月过去，发现自己的勤奋程度离卡马克还差得远，设想的16小时/天的工作时间事实上成为了神游睡觉吃饭无聊时间，『身外之物』却是真心都快木有了 。。。先是寒假上了个新东方，只记得老师讲了一系列希腊童话神话故事，词汇给几个意思就完事儿了，只好怒查有道跪求大韦能给相应的语境了 。。。于是，1.4k软妹币华丽丽地在16天内挥霍完毕。 距离毕业还有三个月时间，不能不感慨时光荏苒，天长地久有时尽，暂凭杯酒长精神。毕设在哪里呀毕设在哪里，毕设在那知网的文献里 。。。 不过如果你是IEEE Explore/ACM Portal用户，请无视 。。。 一句话与广大读研/工作/出国/创业的同学共勉：唯有坚持初心，方能offer加身。 楼主灌水完毕。以下是严肃的正文： 最近正在和陈钢师兄、好友高涛协作翻译一本R语言的入门书籍《R in Action》。本书的原作者为Quick-R站点的创建者Robert I. Kabacoff博士。本人有幸负责前七章的翻译工作，这里是发表在图灵社区的一段早期试读： 图灵社区：阅读：为何要使用R？ 欢迎大家跟贴批评指正。 如果说书写的原罪是漫无目的的流徙，那么，译笔的原罪，是不是有的放矢的面壁呢？]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2012/03/wanfengwudie.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2012/03/wanfengwudie.jpg" alt="晚风舞蝶" title="晚风舞蝶" width="500" height="373" class="aligncenter size-full wp-image-1027" /></a></p><p>不为技术唱赞歌，只为吐槽说人话。</p><p>年初的时候更新过一条状态，算是今年一个小小的愿景，『2012，像卡马克一样工作，像柳智宇一样生活。』（请问：句号的位置有没有错误？）</p><p>两个月过去，发现自己的勤奋程度离卡马克还差得远，设想的16小时/天的工作时间事实上成为了神游睡觉吃饭无聊时间，『身外之物』却是真心都快木有了 。。。先是寒假上了个新东方，只记得老师讲了一系列希腊<del datetime="2012-03-15T03:49:42+00:00">童话</del>神话故事，词汇给几个意思就完事儿了，只好怒查有道跪求大韦能给相应的语境了 。。。于是，1.4k软妹币华丽丽地在16天内挥霍完毕。</p><p>距离毕业还有三个月时间，不能不感慨时光荏苒，天长地久有时尽，暂凭杯酒长精神。毕设在哪里呀毕设在哪里，毕设在那知网的文献里 。。。 不过如果你是IEEE Explore/ACM Portal用户，请无视 。。。</p><p>一句话与广大读研/工作/出国/创业的同学共勉：唯有坚持初心，方能offer加身。</p><p>楼主灌水完毕。以下是严肃的正文：</p><p>最近正在和<a href="http://gossipcoder.com/">陈钢</a>师兄、好友<a href="http://www.gaotao.name/cn/">高涛</a>协作翻译一本R语言的入门书籍《R in Action》。本书的原作者为<a href="http://www.statmethods.net">Quick-R</a>站点的创建者Robert I. Kabacoff博士。本人有幸负责前七章的翻译工作，这里是发表在图灵社区的一段早期试读：<br /> <a href="http://www.ituring.com.cn/article/1207"><br /> 图灵社区：阅读：为何要使用R？</a></p><p>欢迎大家跟贴批评指正。</p><p>如果说书写的原罪是漫无目的的流徙，那么，译笔的原罪，是不是有的放矢的面壁呢？</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/r_language/ria.html/feed</wfw:commentRss> <slash:comments>11</slash:comments> </item> <item><title>2011</title><link>http://www.road2stat.com/cn/life/2011.html</link> <comments>http://www.road2stat.com/cn/life/2011.html#comments</comments> <pubDate>Sat, 31 Dec 2011 13:47:55 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[生活点滴]]></category> <category><![CDATA[2011]]></category> <category><![CDATA[2012]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=1011</guid> <description><![CDATA[2011的记忆从未消失过, 正如2011的承诺没有改变过明天. 希望在2012中, 多干活少吐槽, 本着什么都不靠只靠谱的原则, 继续靠谱下去.]]></description> <content:encoded><![CDATA[<p>2011的记忆从未消失过, 正如2011的承诺没有改变过明天.</p><p>希望在2012中, 多干活少吐槽, 本着什么都不靠只靠谱的原则, 继续靠谱下去.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/life/2011.html/feed</wfw:commentRss> <slash:comments>7</slash:comments> </item> <item><title>豆瓣评分计算策略的猜想</title><link>http://www.road2stat.com/cn/statistics/douban_rank.html</link> <comments>http://www.road2stat.com/cn/statistics/douban_rank.html#comments</comments> <pubDate>Sat, 31 Dec 2011 12:48:49 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[douban]]></category> <category><![CDATA[IMDB]]></category> <category><![CDATA[quantreg]]></category> <category><![CDATA[XML]]></category> <category><![CDATA[公式]]></category> <category><![CDATA[分位回归]]></category> <category><![CDATA[参数]]></category> <category><![CDATA[排序]]></category> <category><![CDATA[测度]]></category> <category><![CDATA[计算]]></category> <category><![CDATA[评分]]></category> <category><![CDATA[豆瓣]]></category> <category><![CDATA[豆瓣电影250]]></category> <category><![CDATA[距离]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=985</guid> <description><![CDATA[1 引 在九月短文 [1] 中, 我们对豆瓣电影评分的一个侧面有了简单认识. 其实, 我们对评分计算规则本身也是很感兴趣的. 这里以豆瓣电影为例作一简单猜想和分析, 音乐图书同理. 题中"策略"是相对"机制"来说的, 所指其实是比较具体的. 2 单个条目 有群众表示, 单个条目的评分计算只是对各个星级打分人数简单的加权平均, 由于页面上显示的评分结果满分是10分, 而打分时只有5个星级, 所以每个星级对应2分, 单个条目评分的计算公式即为: 评分 = (10 x 5星比例) + (8 x 4星比例) + ... + (2 x 1星比例) 抽取部分条目对此假设进行手工验证, 可以发现的确如此. 但是, 这里存在的一个陷阱是, 由于评分数据的特殊性和抽样的限制, &#8230; <a href="http://www.road2stat.com/cn/statistics/douban_rank.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/simpsons_movie.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/simpsons_movie.jpg" alt="simpsons_movie" title="simpsons_movie" width="500" height="325" class="aligncenter size-full wp-image-986" /></a></p><h2>1 引</h2><p>在九月短文 [1] 中, 我们对豆瓣电影评分的一个侧面有了简单认识. 其实, 我们对评分计算规则本身也是很感兴趣的. 这里以豆瓣电影为例作一简单猜想和分析, 音乐图书同理. 题中"策略"是相对"机制"来说的, 所指其实是比较具体的.</p><h2>2 单个条目</h2><p>有群众表示, 单个条目的评分计算只是对各个星级打分人数简单的加权平均, 由于页面上显示的评分结果满分是10分, 而打分时只有5个星级, 所以每个星级对应2分, 单个条目评分的计算公式即为:</p><p><code>评分 = (10 x 5星比例) + (8 x 4星比例) + ... + (2 x 1星比例)</code></p><p>抽取部分条目对此假设进行手工验证, 可以发现的确如此.</p><p>但是, 这里存在的一个陷阱是, 由于评分数据的特殊性和抽样的限制, 如果我们抽取一部分数据做回归, 结果可能会受到样本的影响而与手工验证的结果产生偏移. 由于1星(很差)和2星(较差)在大量条目样本中所占往往比例非常小, 普通的回归非常容易倾向于使X1, X2, X3的系数减小. 举例来说, 从<a href="http://movie.douban.com/people/road2stat/collect?sort=time&#038;mode=list" target="_blank">我看过</a>的443部电影中抽取前400个条目作为样本 <a href='http://www.road2stat.com/cn/wp-content/attachments/2011/12/rateSample.csv'>[rateSample.csv]</a> 作回归.</p><p><span id="more-985"></span></p><p>回归结果:<br /> <code>Call:<br /> lm(formula = Y ~ . - 1, data = rateSample)</p><p>Residuals:<br /> Min        1Q    Median        3Q       Max<br /> -0.248949 -0.036126  0.001084  0.042101  0.217110</p><p>Coefficients:<br /> Estimate Std. Error t value  Pr(>|t|)<br /> X5  9.97216    0.01884  529.436 <2e-16 ***<br /> X4  7.98297    0.03287  242.901 <2e-16 ***<br /> X3  5.64024    0.05587  100.962 <2e-16 ***<br /> X2  4.41272    0.29841   14.787 <2e-16 ***<br /> X1  1.38666    0.64675    2.144  0.0326 *<br /> ---<br /> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</p><p>Residual standard error: 0.06798 on 395 degrees of freedom<br /> Multiple R-squared: 0.9999,	Adjusted R-squared: 0.9999<br /> F-statistic: 1.053e+06 on 5 and 395 DF,  p-value: < 2.2e-16</code></p><p>由于这里的评分人数比例存在四舍五入现象, 所以每个条目5个星级的评分人数比例之和并不一定严格为1, 不过画图可知基本都处于[0.999, 1.001], 存在4个和为0.998, 1.002的样本, 不影响结果.</p><p>观察回归结果发现, 样本的这种特殊情况确对X3, X2, X1项有影响, 虽然检验结果是显著的, 但偏离了真实值6, 4, 2很远.</p><p>插句题外话, Box(是的, 就是你知道的那个Box)曾曰, <em>Statisticians, like artists, have the bad habit of falling in love with their models</em>.</p><p>爱不爱上模特的事情我不是很懂, 不过, 真的不要爱上模型. 原因么, 我们续写一下名句就知道了:</p><p><em>Models are always with their assumptions</em>.</p><p>题外话完毕. 使用平行坐标图展示一下这400部的评分数据:</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rate_para_coord.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rate_para_coord.png" alt="rate_para_coord" title="rate_para_coord" width="500" height="309" class="aligncenter size-full wp-image-988" /></a></p><p>可见大部分样本的整体评分状况. 1, 2星的数量远不及3, 4, 5星.</p><p>猜想这种情况可能是两种原因所致:</p><ul><li>豆瓣的电影推荐让用户更倾向于去看他人评价较高的影片, 越是评分较低的影片越无人光顾, 于是拉高了整体评分;</li><li>用户可能对那些质量一般的电影疏于打分, 而对自己喜欢的片子倾向于打分. 这就造成了客观上部分低分评分数据的缺失, 特别是在5分制(不同于IMDB的10分制)对于影片的区分度比较低的情况下.</li></ul><h2>3 豆瓣电影250榜单</h2><p>豆瓣电影和IMDB都有TOP 250榜单. 关于豆瓣电影250的计算方法, 之前已经有一些讨论 [2]. 一个有趣的问题是, 假设豆瓣的确使用了IMDB公式 [3] 计算得到此榜单, 可否由数据反演出公式中的两个参数?</p><p>其实, 这是一个以排序为因变量的回归问题, 既不同于传统的纯回归问题, 又不同于经典的排序学习问题. 事实上, 这个问题对两方面都提出了比较高的要求:</p><ol><li>对于传统的回归问题, 这里我们虽然想求得回归系数, 但目标变量是一种排序;</li><li>对于经典的排序学习问题, 这里我们虽然目标变量是排序, 但要求针对回归方程求出显式的回归系数;</li><li>如果转化为传统的分类问题, 信息会有比较大的损失.</li></ol><p>翻箱倒柜, 发现KDD10'的一片文章中, 来自Google的D. Sculley提出了一种方法 [4] (给出了现成工具 [5], 还有人port了R包 [6]) 来处理这类问题, 但由于这里的问题是非线性的, 不好直接处理.</p><p>不过不要对生活失去信心. 由于维度较低, 最终仍然有一种方法是可行的: 我们可以估计出参数所在的大致区间, 然后针对某种排序准确性的测度(这里单纯地采用了街区距离和欧氏距离), 暴力搜索这些区间组成的空间, 最后取排序结果最准确的点或点集. 按这个思路做了一下, 在<strong>使用这个公式的假设下</strong>, 可以估计取得数据时 (2011/12/26) 的参数C约为[6.0, 6.1], 参数m大致在[2900, 3100].</p><p>由于豆瓣电影250的榜单并非实时更新, 而榜单中的评分人数和得分却是实时更新的, 且网格的密度有限, 猜解结果理应存在误差. 我们猜测, 这种现实情况可能对于位于榜单后半部分的影片产生更强的影响, 而榜单上排名靠前的影片则会相对稳定. 对各个元素所在位置与其真实位次产生的偏移做分位回归 [7]:</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rank_shift_quantreg.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/rank_shift_quantreg.png" alt="rank_shift_quantreg" title="rank_shift_quantreg" width="500" height="470" class="aligncenter size-full wp-image-989" /></a></p><p>红色虚线为最小二乘估计, 灰色实线为0.1, 0.25, 0.75, 0.9五个分位点处的分位回归估计, 黑色实线为0.5分位点.</p><p>看图说话:</p><ul><li>榜单偏后部分的偏移比前半部分稍强;</li><li>不同分位点的样本点偏移程度受其所在榜单位置的影响稍有不同.</li></ul><p>有兴趣的同学还可以跟踪观察榜单更新时得到的计算结果将有何变化.</p><h2>4 结</h2><p>有益的思考:</p><ol><li>单个条目的评分就是单纯的对5个得分进行加权平均. 此法尚有修正空间, 如果未考虑不同用户的评分权重, 则容易引入恶意评分问题;</li><li>豆瓣250榜单的计算可能借鉴了IMDB公式, 也可能对其设计进行了修改. 关于这个公式本身, 存在一些评论 [8], 或可对其进行修正;</li><li>在<strong>使用这个公式的假设下</strong>, 可以根据豆瓣电影250榜单的变化情况即时猜解参数, 从而了解当时(设定的)所有电影的平均分和上榜最低评分人数标准. 由这两个参数, 结合现有榜单所含信息, 我们可以在榜单更新延迟时, 提前推得某个条目在有一定评价人数(>m)时, 达到某个位置所需的最低得分; 或保持一定得分前提下, 分析上榜所需的最少评分人数.</li></ol><p>存在的问题:</p><ol><li>如果实际上未使用原始公式, 则以上估计是几乎没有什么意义的;</li><li>如果榜单有人为因素的干预, 例如只计算经常打分的用户的打分, 将对这种估计造成影响 [8];</li><li>排序准确性测度有待商榷;</li><li>这种解法虽然给出了全局最小, 但这个全局最小并不一定与真实参数等价, 真实参数也有可能隐匿在其它较小值的集合中.</li></ol><p>最后, 有代码有真相. <a href='http://www.road2stat.com/cn/wp-content/attachments/2011/12/dbrank.R'>[dbrank.R]</a></p><h2>参考</h2><p>[1] R2S. <a href="http://www.road2stat.com/cn/statistics/douban_rating.html" title="豆瓣用户对不同类型影片的打分是否真的有倾向性?" target="_blank">豆瓣用户对不同类型影片的打分是否真的有倾向性?</a></p><p>[2] 麻油四. <a href="http://www.douban.com/group/topic/2426734/" target="_blank">豆瓣250算法浅析</a>.</p><p>[3] Wikipedia. <a href="http://en.wikipedia.org/wiki/Internet_Movie_Database" target="_blank">Internet Movie Database</a>.</p><p>[4] D. Sculley. Combined Regression and Ranking. Proceedings of the 16th Annual SIGKDD Conference on Knowledge Discover and Data Mining, 2010.</p><p>[5] D. Sculley. <a href="http://code.google.com/p/sofia-ml/" target="_blank">sofia-ml</a> - Suite of Fast Incremental Algorithms for Machine Learning.</p><p>[6] Michael King and Fernando Cela Diaz. (2011). <a href="http://CRAN.R-project.org/package=RSofia" target="_blank">RSofia</a>: Port of sofia-ml to R.</p><p>[7] Roger Koenker (2011). <a href="http://CRAN.R-project.org/package=quantreg" target="_blank">quantreg</a>: Quantile Regression.</p><p>[8] <a href="http://www.azillionmonkeys.com/qed/imdbfix.shtml" target="_blank">Corrected IMDb Movie Rankings</a>.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/douban_rank.html/feed</wfw:commentRss> <slash:comments>5</slash:comments> </item> <item><title>冬青黑体 vs 华文细黑：叠加对比</title><link>http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html</link> <comments>http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html#comments</comments> <pubDate>Thu, 22 Dec 2011 13:05:43 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[光影之魅]]></category> <category><![CDATA[Hiragino Sans GB]]></category> <category><![CDATA[STXihei]]></category> <category><![CDATA[冬青黑体]]></category> <category><![CDATA[华文细黑]]></category> <category><![CDATA[字体]]></category> <category><![CDATA[对比]]></category> <category><![CDATA[苹果]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=973</guid> <description><![CDATA[北国冰城哈尔滨今年冬季是出奇的暖和, 再次提醒了我们距离2012的到来只剩下一整年, 仍然没有买到船票的同学们要抓紧时间了. 今天让我们叠加比较一下苹果的新旧主力中文字体: 冬青黑体(Hiragino Sans GB W3)和华文细黑(STXihei). 冬青黑体 = 红, 华文细黑 = 蓝. 简要总结: 同等字号下, 冬青黑体字面的确较华文细黑大, 可能有利于屏幕显示; 对笔锋的处理, 没有华文细黑那么夸张, 朴素多了; 冬青黑体在斜弯钩的收笔明显长于华文细黑, 同时压缩了右下角元素的比例, 整体张弛有度, 着墨更加均匀. References [1] Type is Beautiful. 雪豹新简体字体 Hiragino Sans GB. [2] 林泉约. 混乱的国标，不统一的“走”. [3] Wikipedia. Hiragino. &#8230; <a href="http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p>北国冰城哈尔滨今年冬季是出奇的暖和, 再次提醒了我们距离2012的到来只剩下一整年, 仍然没有买到船票的同学们要抓紧时间了. 今天让我们叠加比较一下苹果的新旧主力中文字体: 冬青黑体(Hiragino Sans GB W3)和华文细黑(STXihei).</p><p>冬青黑体 = 红, 华文细黑 = 蓝.</p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_chs.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_chs.png" alt="" title="hiragino_vs_xihei_chs" width="500" height="810" class="aligncenter size-full wp-image-974" /></a></p><p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_cht.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/vs_cht.png" alt="" title="hiragino_vs_xihei_cht" width="500" height="810" class="aligncenter size-full wp-image-975" /></a></p><p>简要总结:</p><ol><li>同等字号下, 冬青黑体字面的确较华文细黑大, 可能有利于屏幕显示;</li><li>对笔锋的处理, 没有华文细黑那么夸张, 朴素多了;</li><li>冬青黑体在斜弯钩的收笔明显长于华文细黑, 同时压缩了右下角元素的比例, 整体张弛有度, 着墨更加均匀.</li></ol><h2>References</h2><p>[1] Type is Beautiful. <a href="http://www.typeisbeautiful.com/2010/01/1894" target="_blank">雪豹新简体字体 Hiragino Sans GB</a>.</p><p>[2] 林泉约. <a href="http://lethean.me/archives/299" target="_blank">混乱的国标，不统一的“走”</a>.</p><p>[3] Wikipedia. <a href="http://zh.wikipedia.org/wiki/Hiragino" target="_blank">Hiragino</a>.</p><p>[4] Lukhnos D. Liu. <a href="http://blog.lukhnos.org/post/195916082/hiragino-sans-gb-a-typeface-with-japanese-soul-and" target="_blank">Hiragino Sans GB: A typeface with Japanese soul and Simplified Chinese look</a>.</p><p>[5] 齐立. <a href="http://www.foundertype.com/index/stylist/ql.html" target="_blank">微软雅黑的设计</a>.</p><p>[6] 李少波. 黑体字研究: [博士学位论文]. 北京: 中央美术学院, 2008.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/imaging/hiragino_vs_xihei.html/feed</wfw:commentRss> <slash:comments>7</slash:comments> </item> <item><title>OpenScholar是个好项目</title><link>http://www.road2stat.com/cn/life/openscholar.html</link> <comments>http://www.road2stat.com/cn/life/openscholar.html#comments</comments> <pubDate>Sat, 10 Dec 2011 17:58:31 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[生活点滴]]></category> <category><![CDATA[CMS]]></category> <category><![CDATA[Drupal]]></category> <category><![CDATA[IQSS]]></category> <category><![CDATA[OpenScholar]]></category> <category><![CDATA[内容管理系统]]></category> <category><![CDATA[学术]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=966</guid> <description><![CDATA[度过了一段史诗般的酒池肉林，华丽丽的两个月木有更新，直到我膝盖中了一箭。 两天前发现了OpenScholar这个项目，是几个IQSS的家伙鼓捣出来的，旨在为院系所实验室这样的研究机构提供一个快速构建大量个人和群体站点的平台，基于Drupal开发，自带了一些biblio这类模块，Google一下会发现还是有一些学校用户的。缺点是全局配置比较痛苦和繁琐，只用来建一个站有点奢侈了。不过非常喜欢它的自带主题，于是果断砍掉原来丑到不能看的静态主页，把长期不更新的页面稍微理顺了一下，太息曰：“内容管理系统，是所有建站者一生都无法逃脱的劫数。” 即使是小学生作文，也是要尽快写完的，十月的时候扔了两个草稿在那，已然忘光了。]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/12/openscholar.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/12/openscholar.png" alt="openscholar" title="openscholar" width="450" height="140" class="aligncenter size-full wp-image-968" /></a></p><p>度过了一段史诗般的酒池肉林，华丽丽的两个月木有更新，直到我膝盖中了一箭。</p><p>两天前发现了<a href="http://openscholar.harvard.edu/" target="_blank">OpenScholar</a>这个项目，是几个<a href="http://www.iq.harvard.edu/" target="_blank">IQSS</a>的家伙鼓捣出来的，旨在为院系所实验室这样的研究机构提供一个快速构建大量个人和群体站点的平台，基于Drupal开发，自带了一些biblio这类模块，Google一下会发现还是有一些学校用户的。缺点是全局配置比较痛苦和繁琐，只用来建一个站有点奢侈了。不过非常喜欢它的自带主题，于是果断砍掉原来丑到不能看的静态<a href="http://www.road2stat.com/" target="_blank">主页</a>，把长期不更新的页面稍微理顺了一下，太息曰：“内容管理系统，是所有建站者一生都无法逃脱的劫数。”</p><p>即使是小学生作文，也是要尽快写完的，十月的时候扔了两个草稿在那，已然忘光了。</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/life/openscholar.html/feed</wfw:commentRss> <slash:comments>2</slash:comments> </item> <item><title>Ten Typical Symptoms of Potential Academic Paranoia</title><link>http://www.road2stat.com/cn/statistics/academic_paranoia.html</link> <comments>http://www.road2stat.com/cn/statistics/academic_paranoia.html#comments</comments> <pubDate>Tue, 11 Oct 2011 16:42:26 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[统计之路]]></category> <category><![CDATA[academic]]></category> <category><![CDATA[Lisa Simpson]]></category> <category><![CDATA[paranoia]]></category> <category><![CDATA[symptoms]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=956</guid> <description><![CDATA[Getting used to writing articles that begin with a section named 'Introduction' or end up with section 'Conclusions'. Always cites several references in any type/length of essays; strongly believes that without the citations, the work will not be recognized by &#8230; <a href="http://www.road2stat.com/cn/statistics/academic_paranoia.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p><a href="http://www.road2stat.com/cn/wp-content/attachments/2011/10/Frink.png"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/10/Frink.png" alt="Prof.Frink" title="Prof.Frink" width="480" height="445" class="aligncenter size-full wp-image-958" /></a></p><ol><li>Getting used to writing articles that begin with a section named 'Introduction' or end up with section 'Conclusions'.</li><li>Always cites several references in any type/length of essays; strongly believes that without the citations, the work will not be recognized by anybody.</li><li>Hates magazines with huge pictures and imprecise textual materials; has a special fondness for two-column, small font, tight dissertations with formulas, three-line tables, and stylish,  dot-and-line formed scalable graphics.</li><li>Uses a reference manager, instead of regular tools such as Google Calendar, to organize daily life.</li><li>Blogs academic topics constantly for 2.5+ years, or has set up a stand-alone blog about  current research.</li><li>Talks academic in 50%+ Twitter/Facebook status in last 2 years, or has pure academic purpose social accounts.</li><li>Used to have at least one horrible nightmares about a B+ ruined perfect straight As, just like Lisa Simpson did.</li><li>Once encountered some data from the middle of nowhere, always considers what its underlying patterns look like; imagines constructing a quantitative model for it, very seriously.</li><li>When saw a problem, couldn't help diving into scholar databases to retrieve related papers, thoroughly read the references and dug recursively; Gigabytes of papers are storaged in the hard-drive eventually.</li><li>Blogs academic paranoia and doesn't feel anything, until now.</li></ol> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/statistics/academic_paranoia.html/feed</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>R连接PostgreSQL</title><link>http://www.road2stat.com/cn/r_language/rpostgresql.html</link> <comments>http://www.road2stat.com/cn/r_language/rpostgresql.html#comments</comments> <pubDate>Mon, 03 Oct 2011 19:30:46 +0000</pubDate> <dc:creator>Xiao Nan</dc:creator> <category><![CDATA[R]]></category> <category><![CDATA[DBI]]></category> <category><![CDATA[PostgreSQL]]></category> <category><![CDATA[数据库]]></category><guid isPermaLink="false">http://www.road2stat.com/cn/?p=949</guid> <description><![CDATA[最近一直在玩DICE三年前的神作《镜之边缘》, 顺便重温了一下一年前的《黑手党II》, 玩得简直是没有什么时间上来灌水了. 游戏之余偶然接触了一个PostgreSQL数据库, 简单记录一下. R连接数据库有几套方案, 其实基本上就是DBI/ODBC/JDBC. 不过话说ODBC和JDBC神马的真是弱爆了. JDBC方案中那个鬼魂一般的依赖rJava, 真的是很难安装. 其实也有一种可能是AUR上的JDK打包得不好, 没能hold住R CMD javareconf的标准. 前些日子安装RWeka时专门研究过rJava的安装脚本, 卡在编译简单JNI程序这句一直不成功, 手动修改各种配置文件无果, 于是果断放弃 ... 吐槽完毕, 顺便拉回正题. 话说这PostgreSQL是伯克利出品, 基于以自己名字命名的协议发布(霸气又外露了), 有着众多优良特性. 其实我总想把这名字读成Post·GRE·SQL, 不难译为"旧GRE的结构化查询语言", 只是, 您这名字让新泽西乡下那儿一心一意革新GRE考试的大爷们情何以堪哪 ... 当然, 最后结果都是万把个英文单词乱入. 同时, 我们选择的RPostgreSQL包是GSoC 08'项目, 有R社区Dirk Eddelbuettel等众牛参与, 正牌DBI系. 另外, Bioconductor项目的用户也贡献了同为DBI系的RdbiPgSQL/pgUtils包. 酌情使用. &#8230; <a href="http://www.road2stat.com/cn/r_language/rpostgresql.html">继续阅读 <span class="meta-nav">&#8594;</span></a>]]></description> <content:encoded><![CDATA[<p>最近一直在玩DICE三年前的神作《镜之边缘》, 顺便重温了一下一年前的《黑手党II》, 玩得简直是没有什么时间上来灌水了. 游戏之余偶然接触了一个PostgreSQL数据库, 简单记录一下.<br /> <a href="http://www.road2stat.com/cn/wp-content/attachments/2011/10/mirrorsedge.jpg"><img src="http://www.road2stat.com/cn/wp-content/attachments/2011/10/mirrorsedge.jpg" alt="mirrorsedge" title="mirrorsedge" width="500" height="336" class="aligncenter size-full wp-image-950" /></a></p><p>R连接数据库有几套方案, 其实基本上就是DBI/ODBC/JDBC. 不过话说ODBC和JDBC神马的真是弱爆了. JDBC方案中那个鬼魂一般的依赖rJava, 真的是很难安装. 其实也有一种可能是AUR上的JDK打包得不好, 没能hold住R CMD javareconf的标准. 前些日子安装RWeka时专门研究过rJava的安装脚本, 卡在编译简单JNI程序这句一直不成功, 手动修改各种配置文件无果, 于是果断放弃 ...<br /> <span id="more-949"></span><br /> 吐槽完毕, 顺便拉回正题. 话说这PostgreSQL是伯克利出品, 基于以自己名字命名的协议发布(霸气又外露了), 有着<a href="http://obmem.info/?p=493" target="_blank">众多优良特性</a>. 其实我总想把这名字读成Post·GRE·SQL, 不难译为"旧GRE的结构化查询语言", 只是, 您这名字让新泽西乡下那儿一心一意革新GRE考试的大爷们情何以堪哪 ... 当然, 最后结果都是万把个英文单词乱入. 同时, 我们选择的RPostgreSQL包是GSoC 08'项目, 有R社区Dirk Eddelbuettel等众牛参与, 正牌DBI系.</p><p>另外, Bioconductor项目的用户也贡献了同为DBI系的RdbiPgSQL/pgUtils包. 酌情使用.</p><pre># For Arch Linux
# 安装PostgreSQL Server
$ sudo pacman -S postgresql
# 启动daemon
$ sudo /etc/rc.d/postgresql start
# 创建用户
$ sudo createuser -s -U postgres
# psql是个好工具
$ psql -l
# 创建数据库
$ createdb newdatabase
# 导入pg_dump文件
$ psql -d newdatabase -U postgres -f dump.sql</pre><p>导入数据的速度还可以, 700多M的pg_dump文件导进去只消几分钟, 空间占用也涨到了2G+. PostgreSQL的使用问题在<a href="https://wiki.archlinux.org/index.php/PostgreSQL" target="_blank">Arch Wiki</a>上有详细说明. 日常管理方面, 力荐一个自由的, 同样遵守PostgreSQL协议的跨平台GUI工具pgAdmin:</p><pre>sudo pacman -S pgadmin3</pre><p>R部分非常容易:</p><pre>require(RPostgreSQL)
# 读入driver
drv = dbDriver("PostgreSQL")
# 填写连接信息
con = dbConnect(drv, dbname = "数据库名",
user = "用户名", password = "密码", port = 5432)
# 查询语句
rs = dbSendQuery(con, statement = "SQL语句")
# 收割结果
df = fetch(rs, n = -1)
# 其实可以直接执行查询返回结果
dbGetQuery(con, "SQL语句")
# 断开连接
dbDisconnect(con)
# 释放资源
dbUnloadDriver(drv)</pre><p>写数据时, 可能会遇到数据类型字符编码等等RP问题. 更多细节还是关注一下文档吧, 话说牛人们往往都是懒得写vignette的. 只有函数reference manual可读的用户你是真真的伤不起啊.</p> ]]></content:encoded> <wfw:commentRss>http://www.road2stat.com/cn/r_language/rpostgresql.html/feed</wfw:commentRss> <slash:comments>6</slash:comments> </item> </channel> </rss>
