Impact Analysis of Training Data Characteristics
for Phishing Email Classification


Akash Sundararaj
+ and Gökhan Kul
 

University of Massachusetts Dartmouth, Dartmouth, Massachusetts, United States of America
{asundararaj6, gkul}@umassd.edu

 

Abstract

E-mail is the most essential form of formal communication for organizations. However, phishing attacks occurring through e-mail are a prevalent threat, and these attacks are steadily rising even after e-mail filters to prevent these attacks have become ubiquitous. In this work, we look into the training data that phishing e-mail detectors to identify the ideal dataset parameters to optimize the phishing e-mail classifiers. To perform this assessment, we surveyed through phishing e-mail detection methods in the literature and identified that majority of phishing e-mail detectors either use structural properties or text mining methods. Therefore, we analyze the optimal ratio for phishing and legitimate e-mails in the training data for these approaches. We design an experiment using Enron dataset and a phishing e-mail collection to evaluate the effectiveness of these methods with varying sizes of legitimate and phishing emails to empirically show their strengths and weaknesses for specific data parameters. We display the influence of the balanced and unbalanced dataset of e-mails on the results produced by the machine learning classifiers. Interestingly, unbalanced datasets provide better accuracy while they consistently provide worse precision and recall compared to balanced datasets. The empirical results also suggest that phishing e-mail filters have not been perfected, warranting that there is still room for development in this area. Our findings will help the researchers to avoid the common mistakes native to this type of threat before building machine learning classifiers for this domain.

 

Keywords: Datasets, E-mail, Machine learning, Phishing detection

 

+: Corresponding author: Akash Sundararaj
Dion, Room 302,
285 Old Westport Road, Dartmouth, MA 02747, United States

 

Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications (JoWUA), Vol. 12, No. 2, pp. 85-98, June 2021 [pdf]

Received: January 16, 2021; Accepted: March 31, 2021; Published: June 30, 2021

DOI: 10.22667/JOWUA.2021.06.30.085