How to Implement K-Nearest Neighbors in Python

K-Nearest Neighbors (KNN) is one of the simplest yet most effective machine learning algorithms. Unlike most algorithms that build a model during training, KNN is a lazy learner—it stores the...

Key Insights

  • K-Nearest Neighbors is a lazy learning algorithm that makes predictions by finding the k most similar training examples and using their labels to vote on the outcome—no training phase required.
  • Distance metrics and feature scaling are critical: Euclidean distance is the default, but features must be normalized or features with larger scales will dominate the distance calculations.
  • The optimal k value balances bias and variance—low k values create complex decision boundaries prone to overfitting, while high k values oversimplify and underfit; use cross-validation to find the sweet spot for your dataset.

Introduction to K-Nearest Neighbors

K-Nearest Neighbors (KNN) is one of the simplest yet most effective machine learning algorithms. Unlike most algorithms that build a model during training, KNN is a lazy learner—it stores the training data and does all the work at prediction time. When you ask it to classify a new point, it finds the k closest training examples and lets them vote on the answer.

For classification, KNN uses majority voting. If you’re predicting whether an email is spam with k=5, it finds the 5 most similar emails in your training set and assigns the class that appears most frequently. For regression, it averages the values of the k nearest neighbors instead.

KNN works best when you have clear clusters in your data, a moderate number of features (typically under 20), and enough training examples to represent your problem space. It struggles with high-dimensional data due to the curse of dimensionality and can be slow with large datasets since it must calculate distances to all training points for each prediction.

The algorithm relies on distance metrics to determine “closeness.” Euclidean distance (straight-line distance) is most common, but Manhattan distance (sum of absolute differences) works better for high-dimensional spaces, and Minkowski distance generalizes both.

Understanding the Mathematics Behind KNN

The foundation of KNN is distance calculation. Euclidean distance between two points is the most intuitive:

import numpy as np

def euclidean_distance(point1, point2):
    """
    Calculate Euclidean distance between two points.
    
    Args:
        point1: numpy array of features
        point2: numpy array of features
    
    Returns:
        float: Euclidean distance
    """
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Example usage
x1 = np.array([1, 2, 3])
x2 = np.array([4, 5, 6])
distance = euclidean_distance(x1, x2)
print(f"Distance: {distance:.2f}")  # Output: Distance: 5.20

The k parameter controls how many neighbors vote. Small k values (like 1 or 3) create complex decision boundaries that fit training data tightly but may overfit. Large k values (like 20 or 50) smooth out predictions but may miss local patterns. The right k depends on your data’s noise level and complexity.

Implementing KNN from Scratch

Building KNN from scratch reveals how simple the algorithm really is. Here’s a complete implementation:

import numpy as np
from collections import Counter

class KNNClassifier:
    def __init__(self, k=3):
        """
        Initialize KNN classifier.
        
        Args:
            k: number of neighbors to consider
        """
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """
        Store training data (no actual training happens).
        
        Args:
            X: training features (n_samples, n_features)
            y: training labels (n_samples,)
        """
        self.X_train = X
        self.y_train = y
    
    def predict(self, X):
        """
        Predict labels for test data.
        
        Args:
            X: test features (n_samples, n_features)
        
        Returns:
            predictions: predicted labels (n_samples,)
        """
        predictions = [self._predict_single(x) for x in X]
        return np.array(predictions)
    
    def _predict_single(self, x):
        """Predict label for a single test point."""
        # Calculate distances to all training points
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        
        # Get indices of k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        
        # Get labels of k nearest neighbors
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        
        # Return most common label
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

# Test the implementation
X_train = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8], [8, 9]])
y_train = np.array([0, 0, 0, 1, 1, 1])
X_test = np.array([[2, 2], [7, 7]])

knn = KNNClassifier(k=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
print(f"Predictions: {predictions}")  # Output: Predictions: [0 1]

Using Scikit-learn’s KNN Implementation

For production code, use scikit-learn’s optimized implementation. It includes performance improvements like KD-trees for faster neighbor searches:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")  # Output: Accuracy: 1.000

Compare different k values to find the best performer:

# Test different k values
k_values = range(1, 31)
accuracies = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

best_k = k_values[np.argmax(accuracies)]
print(f"Best k: {best_k}, Accuracy: {max(accuracies):.3f}")

Preprocessing and Optimization

Feature scaling is non-negotiable for KNN. Without it, features with larger ranges dominate distance calculations. A feature ranging from 0-1000 will overwhelm one ranging from 0-1:

from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train on scaled data
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

Use cross-validation to find the optimal k value systematically:

from sklearn.model_selection import cross_val_score

# Find optimal k using cross-validation
k_range = range(1, 31)
cv_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

optimal_k = k_range[np.argmax(cv_scores)]
print(f"Optimal k: {optimal_k}, CV Score: {max(cv_scores):.3f}")

Real-World Application: Building a Recommendation System

KNN excels at recommendation systems. Here’s a movie recommendation engine using collaborative filtering:

import pandas as pd
from sklearn.neighbors import NearestNeighbors

# Create sample movie ratings dataset
ratings_data = {
    'user_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'movie_id': [1, 2, 3, 1, 2, 4, 2, 3, 4, 1, 3, 4],
    'rating': [5, 4, 3, 4, 5, 2, 3, 5, 4, 5, 4, 3]
}
df = pd.DataFrame(ratings_data)

# Create user-item matrix
user_item_matrix = df.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)

# Train KNN model
knn_model = NearestNeighbors(n_neighbors=3, metric='cosine')
knn_model.fit(user_item_matrix.values)

# Find similar users to user 1
distances, indices = knn_model.kneighbors(user_item_matrix.iloc[0].values.reshape(1, -1))

print(f"Users similar to User 1: {indices[0]}")
print(f"Similarity scores: {1 - distances[0]}")  # Convert distance to similarity

# Recommend movies user hasn't seen
user_ratings = user_item_matrix.iloc[0]
similar_users = indices[0][1:]  # Exclude the user themselves

# Get movies rated highly by similar users
recommendations = []
for movie in user_item_matrix.columns:
    if user_ratings[movie] == 0:  # User hasn't seen this movie
        similar_user_ratings = user_item_matrix.iloc[similar_users][movie]
        avg_rating = similar_user_ratings.mean()
        if avg_rating > 0:
            recommendations.append((movie, avg_rating))

recommendations.sort(key=lambda x: x[1], reverse=True)
print(f"\nRecommended movies: {recommendations}")

Performance Considerations and Best Practices

KNN’s computational complexity is O(n*d) for each prediction, where n is the number of training samples and d is the number of features. This makes it slow on large datasets. Use KD-trees or Ball trees (scikit-learn does this automatically) to reduce search time to O(log n) in low dimensions.

The curse of dimensionality kills KNN performance above 20-30 features. In high dimensions, all points become roughly equidistant, making the “nearest” neighbors meaningless. Apply dimensionality reduction (PCA, feature selection) before using KNN on high-dimensional data.

KNN requires balanced classes. If one class dominates, the majority class will always win the vote. Use weighted voting (closer neighbors get more influence) or balance your dataset through oversampling/undersampling.

Choose KNN when you need a simple baseline, have clear clusters in your data, and can afford the prediction-time cost. Avoid it for real-time applications with large datasets, high-dimensional problems, or when you need model interpretability. Consider alternatives like Random Forests for better scalability or logistic regression for interpretable models.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.