Last Updated on October 30, 2020
K nearest neighbor (KNN) is a simple and efficient method for classification problems. Moreover, KNN is a classification algorithm using a statistical learning method that has been studied as pattern recognition, data science, and machine learning approach.[1], [2] Therefore, this technique aims to assign an unseen point to the dominant class among its k nearest neighbors within the training set.[3]
The process of KNN can be explained as follows: (1) Given a training data to be classified, (2) Then, the algorithm searches for the k nearest neighbors among the pre-classified training data based on some similarity measure, and ranks those k neighbors based on their similarity scores, (3) Then, the categories of the k nearest neighbors are used to predict the category of the test data by using the ranked scores of each as the weight of the candidate categories, (4) If more than one neighbor belongs to the same category then the sum of their scores is used as the weight of that category, the category with the highest score is assigned to the test data provided that it exceeds a predefined threshold.[1]
DATASET
The purpose of this article is to implement the KNN classification algorithm for the Iris dataset. You can download the data from The UCI Machine Learning Repository.
The training data used 50% from the Iris dataset with 75 rows of data and for testing data also used 50% from the Iris dataset with 75 rows. The dataset has four measurements that will use for KNN training, such as sepal length, sepal width, petal length, and petal width. Furthermore, the species or class attribute will use as a prediction, in which the data is classed as Iris-setosa, Iris-versicolor, or Iris-virginica.
Besides, this article also gives information about how to execute it, the python codes, and the technologies used to solve that problem. At the end of the report will show a brief discussion of the classification result using KNN.
DEVELOPMENT
KNN model learning in a supervised way since it is necessary to have a set of training data where the output is known expected. To generate the prediction firstly makes use of the calculation of the Euclidean distance with your neighbors, and then select the nearest k-neighbors, that is:
Continuing with the process, what does this algorithm is to find the k elements that are found more close to the test data and finally find the class.
Technologies Used
To solve this problem, we use Python 3 with the following libraries:
- Pandas: Python library for data structures and statistical tools.[2]
- Numpy: The Python libraries for the base data structure used for data and model parameters were presented as Numpy arrays.[1]
- Matplotlib: Creator of 2D graphics.[4]
IMPLEMENTATION
There is one file of Python code used, the name of the file is Main.py. We are using two files of Training and Testing data on the .csv file. They are IrisTrainingData.csv and IrisTestingData.csv, and the maximum number of k-neighbors is 1-75 according to the count of rows data.
Import libraries:
import pandas as Pandas
import numpy as Numpy
import matplotlib.pyplot as Pyplot
import time
Start Time to seeing the computation time:
start_time = time.time()
Loading Dataset:
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth','PetalLength', 'PetalWidth', 'Species']
TrainingPath, TestingPath = 'IrisTrainingData.csv', 'IrisTestingData.csv'
SPECIES = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
Make a KNN Class:
class KNN():
Build a Function inside the KNN Class:
Function Initialization
Parameter Description:
k(int): The nearest k instances
def __init__(self, k):
self.k = k
Function for Load Training Data
Parameter Description:
TrainingPath(string): File path of the training dataset
ColoumnName(string): Column name of the given dataset
def TrainingData(self, TrainingPath, ColoumnName='Species'):
'''
Load training data
'''
TrainingCSV = Pandas.read_csv(
TrainingPath, header=-1,
names=CSV_COLUMN_NAMES).sample(frac=1).reset_index(drop=True)
# Split the training dataset into features and labels
TrainingFS, self.TrainingLS = TrainingCSV,
TrainingCSV.pop(ColoumnName)
# Normalize features
self.norm_TrainingFS = (TrainingFS - TrainingFS.min()) / \
(TrainingFS.max() - TrainingFS.min())
return self.norm_TrainingFS, self.TrainingLS
Function for Getting Testing Data
Parameter Description:
TestingPath(string): File path of the testing dataset
ColoumnName(string): Column name of the given name
def TestingData(self, TestingPath, ColoumnName='Species'):
'''
Load testing data
'''
TestingCSV = Pandas.read_csv(
TestingPath, header=-1,
names=CSV_COLUMN_NAMES).sample(frac=1).reset_index(drop=True)
# Split the testing dataset into features and labels
TestingFS, self.TestingLS = TestingCSV, TestingCSV.pop(ColoumnName)
# Normalize features
self.norm_TestingFS = (TestingFS - TestingFS.min()) / \
(TestingFS.max() - TestingFS.min())
return self.norm_TestingFS, self.TestingLS
Function for Prediction the label of each testing
Parameter Description:
TestPoint ( < numpy.ndarray > ): Features data frame of testing data
def Prediction(self, TestPoint):
'''
Prediction the label of each testing
'''
Distance = []
# Calculate the feature distances of given data points `TestPoint`
# from the testing dataset `TrainingFS`
for f in self.norm_TrainingFS.values:
Distance.append(sum(map(abs, f - TestPoint)))
# Binding feature distances with training labels
_ = Pandas.DataFrame({"F": Distance, "L": self.TrainingLS})
# Sorting above dataframe by features distance from low to high
# Return the first k training labels
_ = _.sort_values(by='F')['L'][0:self.k].values
return _
Main Program
Initialization
# Initialization
TrainingAccuracy = []
TestingAccuracy = []
# K: from 1 to len(TrainingFS)
for k in range(75):
knn = KNN(k=k + 1)
# Load data
TrainingFS, TrainingLS = knn.TrainingData(TrainingPath)
TestingFS, TestingLS = knn.TestingData(TestingPath)
Training Process
#Training Process
correct = 0 # Number of the correct Prediction from Training
for i, TestPoint in enumerate(TrainingFS.values, 0):
_ = knn.Prediction(TestPoint)
count = [list(_).count('Iris-setosa'),list(_).count('Iris-
versicolor'), list(_).count('Iris-virginica')]
print('Distribution: {}'.format(count))
mode = SPECIES[count.index(max(count))]
if mode == TrainingLS[i]:
correct += 1
print('Prediction: {}'.format(mode), 'TEST_LABEL:
{}'.format(TrainingLS[i]),)
TrainingAccuracy.append(correct / len(TrainingFS))
Testing Process
#Testing Process
correct = 0 # Number of the correct Prediction from Testing
for i, TestPoint in enumerate(TestingFS.values, 0):
_ = knn.Prediction(TestPoint)
count = [list(_).count('Iris-setosa'),list(_).count('Iris-
versicolor'), list(_).count('Iris-virginica')]
print('Distribution: {}'.format(count))
mode = SPECIES[count.index(max(count))]
if mode == TestingLS[i]:
correct += 1
print('Prediction: {}'.format(mode), 'TEST_LABEL:
{}'.format(TestingLS[i]),)
TestingAccuracy.append(correct / len(TestingFS))
Graphic of Training & Testing Accuracy with k = 1 to 75
#Grapich of Testing Accuracy with k = 1 to 75
for (i, EachResult) in enumerate(TrainingAccuracy, 0):
print('k: {}'.format(i + 1), 'Accuracy: {}'.format(EachResult))
Pyplot.figure()
Pyplot.plot(Numpy.arange(0, 75, 1), TrainingAccuracy, color='orange')
Pyplot.plot(Numpy.arange(0, 75, 1), TestingAccuracy, color='g')
Pyplot.legend(('Training Accuracy', 'Testing Accuracy'), loc=3)
Pyplot.title('k - Accuracy')
Pyplot.xlabel('Number of k')
Pyplot.ylabel('Accuracy')
Pyplot.show()
#Grapich of Testing Accuracy with k = 1 to 75
for (i, EachResult) in enumerate(TestingAccuracy, 0):
print('k: {}'.format(i + 1), 'Accuracy: {}'.format(EachResult))
Pyplot.figure()
Pyplot.plot(Numpy.arange(0, 75, 1), TestingAccuracy, color='g')
Pyplot.title('k - Accuracy')
Pyplot.xlabel('Number of k')
Pyplot.ylabel('Accuracy')
Pyplot.show()
print("--- %s seconds ---" % (time.time() - start_time))
RESULT AND DISCUSSION
Explanation of Training and Testing Result
The accuracy of training and testing data is shown in Figure 1. The accuracy of training data appears in the orange line. In orange’s line show that when k=1 the accuracy is 100%, we can say that this condition is overfitting. Otherwise, if we use k >= 50, that would be underfitting because the accuracy becomes under 70%. Furthermore, the number of k between 2 to 49 have the highest accuracy between 89% until 97%.
The accuracy of testing data can see at the green line. The green line shows that the higher accuracy is 97% obtained when k=1 and k=13. The balanced accuracy between training and testing with a smaller number of k is found at k = 16, which has a similar accuracy at 96%.
Result Analysis
I think it is difficult to determine what value of k is the most appropriate, since every time the program is executed, the order of the data in which it is trained changes, causing that the answer varies. So, for Iris classification case can use the highest accuracy at k = 13 or the balance accuracy between training and testing at k = 16. On the other hand, KNN is a simple algorithm of programming and that can be very useful both for classifying.
Read Also: Implementation of Gradient Descent for Energy Efficiency Prediction
REFERENCES
- [1]R. Al-Shalabi, G. Kanaan, and M. H. Gharaibeh, “Arabic text categorization using KNN algorithm,” presented at the the Proc. of Int. multi conf. on computer science and information technology CSIT06, 2006.
- [2]W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, 2010, doi: 10.25080/majora-92bf1922-00a.
- [3]F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for EEG-based brain–computer interfaces,” J. Neural Eng., pp. R1–R13, Jan. 2007, doi: 10.1088/1741-2560/4/2/r01.
- [4]J. D. Hunter, “Matplotlib: A 2D Graphics Environment,” Comput. Sci. Eng., pp. 90–95, 2007, doi: 10.1109/mcse.2007.55.
Featured Image Source: Freepik