Performance of Distributed Text Processing System Using Hadoop

Taerim Lee, Hun Kim, Kyung Hyune Rhee, and Sang Uk Shin⁺

Pukyoung National University, Busan, Republic of Korea
{taeri, mybreathing, khrhee, shinsu}@pknu.ac.kr

Abstract

Big Data brings new challenges to the field of e-Discovery or digital forensics and these challenges are mostly connected to the various methods of data processing. Considering that the most important factors are time and cost in determining success or failure of digital investigation, development of search method comes first to more quickly and accurately find relevant evidence in Big Data. This paper, therefore, introduces a Distributed Text Processing System based on Hadoop called DTPS and explains about the distinctions between DTPS and other similar researches to emphasize the necessity of it. In addition, this paper describes experimental results to find the best architecture and implementation strategy for using Hadoop MapReduce as a major part of the future e-Discovery cloud service.

Keywords: Electronic Discovery, e-Discovery, Digital Forensics, Evidence Search, Hadoop Performance, MapReduce Programming, Distributed Text Processing

+: Corresponding author: Sang Uk Shin

Computer Room 1314, Building 1, Department of IT Convergence and Application Engineering, Daeyeon

Campus (608-737) 45, Yongso-ro, Nam-Gu. Busan, Republic of Korea, Tel: +82-(0)516296249

Journal of Internet Services and Information Security (JISIS), 4(1): 12-24, February 2014 [pdf]