k-Nearest Neighbors Classification Algorithm

k-Nearest Neighbors (kNN) is a machine-learning algorithm. It works like this: we have an existing set of example data, our training set. We have labels for all of this data—we know what class each piece of the data should fall into. When we’re given a new piece of data without a label, we compare that new piece of data to the existing data, every piece of existing data. We then take the most similar pieces of data (the nearest neighbors) and look at their labels. We look at the top k most similar pieces of data from our known dataset; this is where the k comes from. (k is an integer and it’s usually less than 20.) Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify.

k-Nearest Neighbors

  • Pros: High accuracy, insensitive to outliers, no assumptions about data
  • Cons: Computationally expensive, requires a lot of memory
  • Works with: Numeric values, nominal values

Pseudocode for kNN

For every point in our dataset:

  • calculate the distance between inputX and the current point
  • sort the distances in increasing order
  • take k items with lowest distances to inputX
  • find the majority class among these items
  • return the majority class as our prediction for the class of inputX

Suppose we use the Euclidian distance to calculate the distance between two vectors x and y.
$$d= \sqrt[]{(x1-y1)^2+(x2-y2)^2} $$
If there are 4 features in the dataset, the distance between points (1, 2, 3, 4) and (4, 3, 2, 1)
would be calculated by
$$d= \sqrt[]{(1-4)^2+(2-3)^2+(3-2)^2+(4-1)^2}$$

Implementation in Python Code

kNN.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from numpy import *
import operator

def createData():
dataSet = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return dataSet, labels

def classifier(inputX, dataSet, labels, k):
# 1) calculate the distance
dataSetSize = dataSet.shape[0] # get # of rows in dataSet
diffMat = tile(inputX, (dataSetSize, 1)) - dataSet # calculate the difference between 2 matrices
squareDiffMat = diffMat**2 # calculate the square of each element in the matrix
squareDistances = squareDiffMat.sum(axis=1) # calculate the sum of each row in the matrix
distances = squareDistances**0.5 # calculate the square root to get the distance
sortedDistance = distances.argsort() # sort the distance
# 2) select lowest k distances
classCount = {} # this dicionary, key is the label, value is # of count for each label
for i in range(k):
voteLabel = labels[sortedDistance[i]] # find the value of the corresponding label
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 # accumulate that label
# 3) sort the final result
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) # dictionary to sorted list
return sortedClassCount[0][0] # return the label that it is classified into

To use it, open your terminal in the directory where the code file is. And type the following command:

1
2
3
import kNN
input, label = kNN.createData()
kNN.classifier([0,0], input, label, 3)

And you will see your prediction result.