MapReduce的流行是有理由的。它非常简单、易于实现且扩展性强。大家可以通过它轻易地编写出同时在多台主机上运行的程序,也可以使用Ruby、Python、PHP和C++等非Java类语言编写Map或Reduce程序,还可以在任何安装Hadoop的集群中运行同样的程序,不论这个集群有多少台主机。MapReduce适合处理海量数据,因为它会被多台主机同时处理,这样通常会有较快的速度。
引文分析是评价论文好坏的一个非常重要的方面,本例只对其中最简单的一部分,即论文的被引用次数进行了统计。假设有很多篇论文(百万级),且每篇论文的引文形式如下所示:
- References
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
- 2003. Latent dirichlet allocation. Journal of Machine
- Learning Research, 3:993–1022.
- Samuel Brody and Noemie Elhadad. 2010. An unsupervised
- aspect-sentiment model for online reviews. In
- NAACL '10.
- Jaime Carbonell and Jade Goldstein. 1998. The use of
- mmr, diversity-based reranking for reordering documents
- and producing summaries. In SIGIR '98, pages
- 335–336.
- Dennis Chong and James N. Druckman. 2010. Identifying
- frames in political news. In Erik P. Bucy and
- R. Lance Holbert, editors, Sourcebook for Political
- Communication Research: Methods, Measures, and
- Analytical Techniques. Routledge.
- Cindy Chung and James W. Pennebaker. 2007. The psychological
- function of function words. Social Communication:
- Frontiers of Social Psychology, pages 343–
- 359.
- G¨unes Erkan and Dragomir R. Radev. 2004. Lexrank:
- graph-based lexical centrality as salience in text summarization.
- J. Artif. Int. Res., 22(1):457–479.
- Stephan Greene and Philip Resnik. 2009. More than
- words: syntactic packaging and implicit sentiment. In
- NAACL '09, pages 503–511.
- Aria Haghighi and Lucy Vanderwende. 2009. Exploring
- content models for multi-document summarization. In
- NAACL '09, pages 362–370.
- Sanda Harabagiu, Andrew Hickl, and Finley Lacatusu.
- 2006. Negation, contrast and contradiction in text processing.
在单机运行时,想要完成这个统计任务,需要先切分出所有论文的名字存入一个Hash表中,然后遍历所有论文,查看引文信息,一一计数。因为文章数量很多,需要进行很多次内外存交换,这无疑会延长程序的执行时间。但在MapReduce中,这是一个WordCount就能解决的问题。



