# K-Nearest Neighbors vs Naive Bayes Algorithms for Spam Detection in Emails.

## Introduction

A classifier is an algorithm that maps input data to a specific category. The classifier is trained to map data points to their respective class labels. Classifiers can be used in a variety of applications one of them being detecting spam emails. In this blog post, we will compare the performance of two classifiers: K-Nearest Neighbors and Naive Bayes in spam email classification. We will explore how both algorithms perform their training and prediction.

## K-Nearest Neighbors(KNN)

KNN is a memory-based supervised learning algorithm that learns by memorizing the data itself.

**Training**

- Stores the training data and the class labels such that the classifier memorizes the training data. We call the memorized training data
**exemplars**.

**Prediction**

Use the trained KNN classifier to predict the class labels of each testing sample.

*Procedure*

- Calculate the distance from each test sample to all the training exemplars.
- Find k closest exemplars.
- Let the k-closest exemplars vote on the class label.
- The most appearing class label among the k-closest exemplars becomes the predicted class label for the test sample.

**Naive Bayes Classifier**

The Naive Bayes Classifier depends on probability distributions and applies Bayes’ Theorem in its implementation. The classifier is considered “naive” because it makes an assumption that the conditional probabilities of the features are independent. Basically, the classifier works by finding the probabilities of a data point belonging to each of the categories in a data set. The data point is then assigned to the category with the highest probability. In the case of email spam classification, we consider two categories: spam and not spam. Naive Bayes uses Bayes’ theorem to calculate the probability of a new email belonging to either the spam or the not-spam class. The class with the higher probability becomes the predicted class of the test sample.

**Training**

The Naive Bayes classifier is trained so that it records the “statistics” of the training set. These “statistics” are class priors and class likelihoods.

Class Priors: The probability that a data sample belongs to a given class before the experiment is performed. In our example, class priors would be:

The probability that an email is spam = (number of spam emails) / (total number of emails in the training data set)

The probability that an email is not spam = (number of not spam emails) / (total number of emails in the training data set)

Class Likelihoods: The probability of a data sample appearing in each of the classes in a dataset given our initial hypothesis. In our example, the likelihood would be the probability of a word in an email appearing in the spam class or in the not-spam class.

## Prediction

To make predictions, we combine the prior and likelihood to get the posterior distribution.

The predicted class for a test data sample is the class that yields the highest posterior probability. In our example, an email in the test data set will be categorized as either spam or not spam depending on which of the two classes yields a higher posterior probability.

(If you want to better understand how Naive Bayes works, check out this blog by Tony Yiu)

**Application**

I implemented these algorithms in a project I did involving the Enron email dataset. Check out the project here for more information. The overall goal of the project was to implement an email spam filter that can determine, with high accuracy, whether an email is spam or not.

**Conclusion**

K-Nearest Neighbors can be computationally expensive because calculating the distance between a test sample and all the training exemplars can take a very long time, especially with large data sets. It also takes up a lot of memory since the algorithm needs to store all the training data. Choosing the k-closest neighbors to consider for classification can also be a challenge with KNN. The Naive Bayes classifier takes less time to compute and there are no hyperparameters to tune like in choosing k closest neighbors with KNN.