Applying adversarial learning to invade a spam filter


With the development of the Internet and computer sciences, machine learning is being applied to cyber security. Because of its great performances and efficiency, it has become a vital means to combat attacks. Nevertheless, both the attackers and defenders have noticed the implicit insecurity of machine learning. 

This project will apply the concepts of adversarial learning to invade a spam filter. The key problem is to explore how to invade a spam filter so that the attacker can have his malicious emails labeled benign. It is expected that without defensive measures, the Naive Bayes spam filter can be successfully attacked by employing a scenario-specific approach. The spam filter and the attack are realized by Python 3.5.4 (64 bit). Source code and training sets are uploaded to github ( 

By demonstrating the invasion of a spam filter, the project can serve to emphasize the importance of the security of machine learning itself in its application in cyber security. We also concluded the research with some security advice, which can benefit machine learning algorithm designers in improving the security of their models. 

Through experiments, it can be concluded that undefended spam filters are vulnerable to deliberate invasion attacks. Advice for improving security include adding synonyms to the vocabulary list, filtering special characters, employ an SOW (set of words) model, restraining attackers from sending multiple suspicious emails, and updating training sets frequently. 

Future work may include finding a way to attack classifiers with discrete input/output, and constructing a more accurate synonym corpus. 

Question / Proposal

Online machine learning models have been widely used in spam filtering [1,2,3], intrusion detection [4], computer virus detection [5] etc. After obtaining the samples, the designer can choose the appropriate feature space and algorithm, and assess the performance of the model on the testing set. However, undefended classifiers can be vulnerable to attacks. Adversarial learning assumes that the attacker can acquire some or all information of the classifier and discusses how he/she can mislead the classifier using knowledge known. By doing so, he/she can invade a classifier and exploit its weaknesses. As a result, an arm race broke out between the attacker and the defender, with both parties employing increasingly complex approaches. The traditional way of model training (minimizing the target function) are incompetent in cyber security.

Can a machine learning classifier (such as an email spam filter) be manipulated by attacks? How can we invade it? How to improve it so that it can be resistant to attacks? In this project we expect that an undefended classifier can be successfully invaded through a specific approach. Because email classification is a Natural Languange Processing (NLP) problem in its nature, we expect to meet discrete optimization problems that require us to devise a way to resolve it. We also plan to come up with some advice on the security of spam filters. 


Literature Review

Previous research has shown the susceptibility of machine learning algorithms, especially in the field of computer vision. Papernot et al.[8] attacked the deep neural network on the MNIST handwritten numbers database during test time. Goodfellow et al.[9] demonstrated how to change the output category of an input image by adding unnoticeable distraction layers.

Optimal attack strategy during test time

Suppose classification strategy  maps samples  to machine classification output, with -1 denoting benign samples and +1 denoting malicious samples. The classifier is trained on training set , which is extracted from the probability distribution . The classifier determines the output  by thresholding a continuous function . We assume that  when  and  when .

The goal of the attacker

The goal of the attacker is to modify primary malicious sample , so that the modified sample  can both satisfy his/her needs and bypass the classifier. In this process, the attacker usually don't want the distance between  and  to be too far, otherwise  will diverge from his/her original purpose.

The knowledge about the model of the adversary can be categorized into 3 scenarios [15]:

a. The adversary knows the algorithm, parameters and training sets.

b. The adversary knows the algorithm, but doesn't know the parameters and training sets. However, he/she can have  interactions with the classifier.

c. The adversary doesn't know the algorithm. However, he/she can have  interactions with the classifier. 

In scenario b, through limited times of interactions, the adversary can obtain from the probability distribution  a surrogate training set , thus he/she can train a surrogate classifier with the continuous output function , which can be seen as the approximate of the original classifier. Then, the attack can be transformed into an optimization problem

In scenario a, the adversary has already known the function , so the generation of a surrogate classifier can be skipped.

In scenario c, the adversary can only guess the algorithm of the classifier. He/she can refer to literature in that particular field.

Solution to the optimization problem

The optimization problem above can be solved in various ways. It is common to use gradient descent to find the minimum of a function in machine learnimg, but it cannot guarantee the final result is the global minimum. When  is non-convex, the result may be a regional minimum where the output of the classifier is undefined (i.e. the probability ), so the outcome of the attack depends on the symbol of  in this region. The adversary cannot be sure that the attack can succeed. Therefore, the adversary should consider placing  in a region where there are more benign samples. To achieve this effect, we add a penalty term that estimates  to prevent iteration from going into an area of low density.

where  is a weight coefficient and the penalty term

gives the iteration a tendency to mimic benign samples. Iteration will not go where the gradient descends the fastest, but the chance of success will be higher. The discussion above can be summarized in the pseudocode below.

Method / Testing and Redesign

Constructing an email spam filter

According to the Bayesian hyppothesis, texts can be seen as sets of different words. There are 2 methods for estimating probability in Naive Bayes, namely multivariate Bernoulli event model and multinomial event model. The former leads to an SOW (set of words) approach, while the latter leads to a BOW (bag of words) approach. We take the former approach.

Let  be the state of appearance of feature  in text , with  meaning the feature appears in the text and  meaning the feature doesn't appear. Then we have

where  denotes the probability of  when the label of the text is . By smoothening the text frequency, we have

Then we employ risk minimization. Given a priori probability  and conditional probability density , the space of decision is composed of  decisions marked by .   denote the cost of taking action  when the actual label is . We draw the decision chart

According to the Bayesian theorum, a posteriori probability can be found by


We can see in the decision chart that for decision , the conditional expectation cost is

so the question evolves into finding the dicision that produces the minimal cost

Because the detection of spam is a binary classification problem, we can construct the continuous classification function

where the feature vector is

Realization of the email spam filter

We deem it worse to have a legal email classified as spam than have a spam email classified as legal, for the former may cause greater trouble to the user. Thus, we draw the decision chart

The training set we used is from Aprilvkuo [11], which is in Chinese. However the filter can deal with emails in other languages as long as training sets in those languages are available.

A special problem with the Chinese language is word segmentation. In the process of realization, we first employ the open-source python Chinese word segmentation library jieba [16] to segment the corpus. Then, we generate the vocabulary list, simultaneously calculating  and . When processing testing sets, we turn it into word vectors, then apply the Bayesian theorem and obtain the output function through the decision chart. 

Designing the invasion method

For every word in the emails, we define the coefficient

We can tell that when ,word  sways the classifier to output “malicious”, and the greater  is, the greater the tendency for a “malicious ” output is. The opposite is also true. We can acquire the words with the greatest  by sorting the vocabulary list.

To increase the tendency of the spam to be classified as “ham” without sacrificing its original meaning, we employ a synonym find-and-replace process, which we call step 1. Suppose  satisfies , with a synonym set . We seek to find  with the smallest . If  (where  is a parameter), we replace  with .

If  is still positive, we take step 2. For all  with , we impair the structure of  by adding spaces in it. The process can be summarized in the pseudocode below.


The assessment of the spam filter

The email corpus consists of 16788 legitimate emails (hams) and 28574 spams, from which we extract 3000 hams and 3000 spams as the training set and 200 hams and 200 spams as the testing set. Training sets, testing sets and source code can be seen in my github page.

We now define the following coefficients:

a.  demonstrates the ability of the spam filter to detect spams. Higher the recall, lower the number of misclassified spams.

b.  demonstrates the ability of the spam filter to make correct decisions.

Through calculation we obtain  and . We can conclude that our spam filter can perform excellently in traditional scenarios.

The results of invading the spam filter

We extract a surrogate training set from the same probability distribution with 1000 legitimate emails and 1000 spams, which represents the samples available to the attacker. Through deduction, we have

where 205085 is the total word count of legitimate emails and 227563 is the total word count of spam emails. M is the total words in the vocabulary list.

is the input vector. We find the partial derivatives

Through experiments, we find that because the elements in the feature vector are binary, the continuous optimization approach discussed in the “Research” section is not capable of solving this problem. When  is descending along the reverse direction of the gradient, the elements of the vector can become decimals, rendering the word vector meaningless. Hence, it is necessary to come up with a scenario-specific approach. The process of redesigning and alternative methods can be found in “Designing the invasion method” of the “Method / Testing and Redesign” section.

In the realization process, we first preprocess the Harbin Institute of Technology Extended Synonym corpus [17], then truncate the corpus according to the words in the vocabulary (to shorten the time needed to perform algorithm 2). We randomly picked from the dataset a spam email (as seen in AttackPoint0.txt) as . When calculating

We transform the multiplication process to addition process by performing the “log” operation on both sides. When calculating  and , we perform the transformation

Hence the actual  is different from that of theoretical discussions.

Output of  is 1.6809147978394804e+55. After step 1,  is 829953.0933925719. Repeat step 2 until convergence, we have . The email is still classified by the filter as a spam. As the results have shown, the model proposed in this project is robust and is hard to be invaded. The attacker can only replace the words with high  as shown in Results\FeaturesSorted.txt to increase the chance of success.

If the attacker knows all information about the model (scenario a), he can use  to perform a direct attack. In this case the output of  is 8.037851463178479e+62. After step 1,  is 3916.4914696729998. Repeat step 2 until convergence, we have . The email is eventually classified by the filter as a ham. It can be inferred that the failure of the previous attack may be due to a surrogate training set that's too small.


Security advice for email spam classifiers

1. The adversary may replace sensitive words with their synonyms to bypass filtration. Synonyms should be added to the vocabulary list with a similar  in the training process.

2. Adding spaces or special characters within complete words can result in a different output. Noise should be removed in the preprocessing of the corpora.

3. Adversaries may manipulate word frequencies to mislead the classifier. A SOW (set of words) approach is recommended.

4. Multiple spams can be sent from the adversary’s email address when he/she is probing the reaction of the filter. When this kind of suspicious behavior is detected, human intervention is needed. If the adversary is discovered to be sniffing the probability distribution of the model, his/her email address should be frozen.

5. As the language conventions change, the designer should constantly update their training sets.

Research conclusion

In this project we first proposed an algorithm for invading a machine learning classifier against the backdrop of adversarial learning. The algorithm was improved by adding a penalty term facilitating the discovery of a global minimum.

Next, we constructed, realized, and assessed an email spam filter based on Naïve Bayes with minimal risk. Experiments showed great recall and accuracy for the classifier.

Then we thought about how to invade a spam classifier. Based on the ideas before and the fact that the input of the model is discrete, we constructed a new method to attack the filter and randomly picked a spam to test it. The attack process misleads the classifier by replacing and separating words. Experiments showed that our model is relatively robust when facing deliberate intrusions.

Finally, we came up with security advice for a spam filter based on the principles of the model and the way we had invaded it. Through demonstrating an attack during test time, we hope to emphasize that apart from traditional assessments, the model also needs to adapt to deliberate attacks. As the skills of the adversary develop, the defender needs to be insightful and implement necessary defensive strategies to combat attacks.

Possible future work

1. Algorithm 1 is for continuous models. The intrusion techniques for discrete models are yet to be explored.

2. The scope of the synonym corpus is too broad. For the paraphrased emails to make sense, the corpus should be truncated.

About me

Hi! I'm Yuanhao "Michael" Luo from Shanghai Experimental School, Shanghai, China. I am a big fan of Mediterranean history, a lover of philosophy, an appreciator of Renaissance arts, and most of all, an amazed traveller in the fascinating world of maths and computer science. I enjoy the process of generating great ideas and putting them into practice. 

The computer has deeply appealed to me since I was in primary school. I often wondered how the games are coded so that they could be so vivid and entrancing. With the dream of designing my own game, I embarked on the unbelievable journey of computer sciences. I taught myself languages such as C++ and python, and started using them to solve problems in daily life. I developed a software for my teachers so that they can analyze and distribute the performances of students with more convenience. Cooperating with our dean of studies, I managed to allot seats for students during important exams.

I am also the leader of a mathematical modeling team. We dealt with problems from finding the best hospital for a patient to projecting the future population of China. Our team has already won several great prizes, including the Meritorious of the international round of the 4th International Mathematical Modeling Challenge. 

I hope to use what I've learnt and created to benefit the world. Winning the GSF can help me to realize that goal and give me greater motivation to move forward!

Health & Safety

The only health & safety point I'd like to state here is that don't stay up too late and stare at the computer screen for too long (which I myself didn't manage to observe).

Bibliography, references, and acknowledgements


I'd like to thank my parents and teachers for supporting me in my pursuit of knowledge and innovation. 


[1]      Hao, Shuang, et al. "Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine." USENIX security symposium. Vol. 9. 2009.

[2]      Ramachandran, Anirudh, Nick Feamster, and Santosh Vempala. "Filtering spam with behavioral blacklisting." Proceedings of the 14th ACM conference on Computer and communications security. ACM, 2007.

[3]      李雯, and 刘培玉. "基于贝叶斯的垃圾邮件过滤算法的研究." 计算机工程与应用 43.23 (2007): 174-176.

[4]      Liao, Hung-Jen, et al. "Intrusion detection system: A comprehensive review." Journal of Network and Computer Applications 36.1 (2013): 16-24.

[5]      Stolfo, Salvatore J., et al. "Detecting viral propagations using email behavior profiles." ACM transactions on internet technology (TOIT) (2004): 128-132.

[6]      Dekel, Ofer, Ohad Shamir, and Lin Xiao. "Learning to classify with missing and corrupted features." Machine learning 81.2 (2010): 149-178.

[7]      Brückner, Michael, Christian Kanzow, and Tobias Scheffer. "Static prediction games for adversarial learning problems." Journal of Machine Learning Research 13.Sep (2012): 2617-2654.

[8]      Papernot, Nicolas, et al. "The limitations of deep learning in adversarial settings." Security and Privacy (EuroS&P), 2016 IEEE European Symposium on. IEEE, 2016.

[9]      Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples (2014)." arXiv preprint arXiv:1412.6572.

[10]  Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York.


[12]  张华平, 刘群. 基于N-最短路径方法的中文词语粗分模型[J]. 中文信息学报, 2002, 16(5):3-9.

[13]  黄昌宁, 高剑峰, 李沐. 对自动分词的反思[C]// 全国计算语言学联合学术会议. 2003.

[14]  俞鸿魁, 张华平, 刘群,等. 基于层叠隐马尔可夫模型的中文命名实体识别[J]. 通信学报, 2006, 27(2):87-94.

[15]  Biggio, Battista, et al. "Evasion attacks against machine learning at test time." Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, 2013.


[17]  语言技术平台[J], 中文信息学报, 2011, 25(6):53-63