How to Choose K in KNN in Python

Key Insights

The choice of K in KNN directly controls the bias-variance tradeoff: small K values (1-3) create flexible models prone to overfitting, while large K values oversmooth decision boundaries and underfit
Cross-validation is the gold standard for K selection—systematically test K values from 1 to sqrt(n) and plot validation accuracy to identify the optimal point before performance plateaus
Always use odd K values for binary classification to avoid ties, and consider that computational cost scales linearly with K, making smaller optimal K values preferable when performance is equivalent

Introduction to K in KNN

The K-Nearest Neighbors algorithm is deceptively simple: classify a point based on the majority vote of its K nearest neighbors. But this simplicity hides a critical decision—choosing the right value for K can mean the difference between a model that generalizes well and one that fails spectacularly.

When K is too small, your model becomes hypersensitive to noise. With K=1, every training point creates its own island of influence, leading to jagged decision boundaries that memorize training data rather than learning patterns. This is high variance—your model changes dramatically with small changes in training data.

When K is too large, you’re averaging over so many neighbors that genuine local patterns get washed out. With K equal to your entire dataset, every prediction would just be the most common class. This is high bias—your model is too rigid to capture real relationships.

Here’s what this looks like in practice:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_moons

# Generate non-linear dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)

# Create meshgrid for decision boundary
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Compare different K values
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
k_values = [1, 5, 50]

for ax, k in zip(axes, k_values):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X, y)
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3)
    ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
    ax.set_title(f'K={k}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

K=1 produces overly complex boundaries that follow every quirk of the training data. K=5 captures the underlying moon shape reasonably well. K=50 oversmooths, creating boundaries that miss important patterns.

The Square Root Rule

The most common heuristic is startlingly simple: set K to the square root of your training samples. For 10,000 samples, try K=100. For 400 samples, try K=20.

This rule works because it balances two competing needs. As your dataset grows, you can afford to average over more neighbors without losing local detail. The square root provides sublinear growth—K increases with dataset size, but not proportionally.

The limitations are significant. This rule ignores your data’s dimensionality, class balance, noise level, and the actual complexity of decision boundaries. It’s a starting point, not a solution.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.3, random_state=42
)

# Apply square root rule
n_samples = X_train.shape[0]
k_sqrt = int(np.sqrt(n_samples))
if k_sqrt % 2 == 0:  # Make it odd
    k_sqrt += 1

print(f"Training samples: {n_samples}")
print(f"Suggested K: {k_sqrt}")

# Evaluate
knn_sqrt = KNeighborsClassifier(n_neighbors=k_sqrt)
knn_sqrt.fit(X_train, y_train)
y_pred = knn_sqrt.predict(X_test)
print(f"Accuracy with K={k_sqrt}: {accuracy_score(y_test, y_pred):.3f}")

Use this as your baseline, then validate whether it’s actually optimal for your specific problem.

Cross-Validation Approach

Cross-validation is the empirical approach: test multiple K values systematically and measure which performs best on held-out data. This is more computationally expensive but far more reliable than heuristics.

The strategy is straightforward—try K values from 1 up to sqrt(n), use k-fold cross-validation to estimate each K’s performance, then plot the results to identify the optimal value.

from sklearn.model_selection import cross_val_score

# Test range of K values
k_range = range(1, 51)
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=5, 
                            scoring='accuracy')
    k_scores.append(scores.mean())

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(k_range, k_scores, marker='o')
plt.xlabel('K Value')
plt.ylabel('Cross-Validated Accuracy')
plt.title('KNN Performance vs K')
plt.grid(True, alpha=0.3)

# Mark the optimal K
optimal_k = k_range[np.argmax(k_scores)]
plt.axvline(x=optimal_k, color='r', linestyle='--', 
            label=f'Optimal K={optimal_k}')
plt.legend()
plt.show()

print(f"Optimal K: {optimal_k}")
print(f"Best CV Accuracy: {max(k_scores):.3f}")

Look for the K value where accuracy peaks. Often you’ll see accuracy rise quickly from K=1, plateau around an optimal range, then gradually decline as K grows too large. If multiple K values perform similarly, choose the smaller one—it’s faster and simpler.

Grid Search with scikit-learn

GridSearchCV automates the cross-validation approach, handling the iteration and scoring for you. It’s cleaner code and provides additional functionality like refitting on the full dataset.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_neighbors': list(range(1, 51)),
    'weights': ['uniform', 'distance'],  # Bonus: test weighting schemes
    'metric': ['euclidean', 'manhattan']
}

# Create grid search
knn = KNeighborsClassifier()
grid_search = GridSearchCV(
    knn, 
    param_grid, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

# Fit grid search
grid_search.fit(X_train, y_train)

# Extract results
print(f"Best K: {grid_search.best_params_['n_neighbors']}")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Test on holdout set
test_accuracy = grid_search.score(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.3f}")

GridSearchCV also lets you tune other hyperparameters simultaneously. Distance weighting (closer neighbors count more) often improves performance, especially with larger K values. Different distance metrics can matter depending on your feature scales and relationships.

The Elbow Method

The elbow method visualizes the tradeoff between model complexity and performance. Plot both training and validation error against K—the “elbow” is where validation error stops improving significantly.

from sklearn.metrics import accuracy_score

k_range = range(1, 51)
train_scores = []
val_scores = []

# Split training data for validation
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_tr, y_tr)
    
    train_scores.append(knn.score(X_tr, y_tr))
    val_scores.append(knn.score(X_val, y_val))

# Plot both curves
plt.figure(figsize=(10, 6))
plt.plot(k_range, train_scores, label='Training Accuracy', marker='o')
plt.plot(k_range, val_scores, label='Validation Accuracy', marker='s')
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('Training vs Validation Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Find elbow using second derivative
val_scores_array = np.array(val_scores)
second_derivative = np.diff(val_scores_array, n=2)
elbow_idx = np.argmax(second_derivative) + 2
print(f"Elbow point at K={k_range[elbow_idx]}")

Training accuracy decreases monotonically as K increases—more averaging means less perfect fit to training data. Validation accuracy rises from K=1, peaks, then declines. The elbow is where the validation curve flattens—beyond this point, increasing K doesn’t help much.

Practical Considerations

Odd vs Even K: For binary classification, always use odd K to avoid ties. With K=4 and two neighbors from each class, which class wins? Scikit-learn breaks ties by choosing the class that appears first in the training data, which introduces arbitrary bias.

# Demonstrate tie-breaking issue
from sklearn.datasets import make_classification

X_binary, y_binary = make_classification(
    n_samples=100, n_features=2, n_redundant=0, 
    n_informative=2, n_clusters_per_class=1, 
    n_classes=2, random_state=42
)

X_tr, X_te, y_tr, y_te = train_test_split(
    X_binary, y_binary, test_size=0.3, random_state=42
)

# Compare even vs odd K
for k in [4, 5]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_tr, y_tr)
    print(f"K={k} accuracy: {knn.score(X_te, y_te):.3f}")

Computational Cost: Prediction time scales linearly with K. For each prediction, the algorithm finds K neighbors, then aggregates their votes. If K=1 takes 10ms, K=100 takes roughly 1000ms.

Dataset Size: With small datasets (<1000 samples), use cross-validation extensively. The sqrt rule becomes unreliable. With large datasets (>100k samples), computational constraints matter more—test fewer K values and prefer smaller K.

Dimensionality: In high dimensions, all points become equidistant (the curse of dimensionality). KNN performance degrades, and K selection matters less because no K works well. Consider dimensionality reduction first.

Complete Example with Real Dataset

Here’s an end-to-end workflow combining multiple approaches:

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Load and prepare data
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (critical for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Method 1: Square root rule
k_sqrt = int(np.sqrt(len(X_train)))
if k_sqrt % 2 == 0:
    k_sqrt += 1
print(f"Square root rule suggests K={k_sqrt}")

# Method 2: Cross-validation
k_range = range(1, min(51, int(np.sqrt(len(X_train))) * 2))
cv_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, 
                            cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

optimal_k = k_range[np.argmax(cv_scores)]
print(f"Cross-validation suggests K={optimal_k}")
print(f"CV accuracy: {max(cv_scores):.3f}")

# Method 3: Grid search with multiple parameters
param_grid = {
    'n_neighbors': list(k_range),
    'weights': ['uniform', 'distance']
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, 
                   cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_scaled, y_train)
print(f"Grid search suggests K={grid.best_params_['n_neighbors']}")
print(f"Best parameters: {grid.best_params_}")

# Final evaluation
final_model = grid.best_estimator_
test_accuracy = final_model.score(X_test_scaled, y_test)
print(f"\nFinal test accuracy: {test_accuracy:.3f}")

This workflow demonstrates that different methods often converge on similar K values. When they disagree, trust cross-validation over heuristics, but consider the computational tradeoffs of your chosen K.

The right K depends on your specific data, problem complexity, and performance requirements. Start with sqrt(n), validate with cross-validation, and always test on held-out data before deploying your model.