How to Find Unique Values in NumPy
Finding unique values is one of those operations you'll perform constantly in data analysis. Whether you're cleaning datasets, encoding categorical variables, or simply exploring what values exist in...
Key Insights
np.unique()is the workhorse function for finding unique values in NumPy, but its optional parameters (return_counts,return_index,return_inverse) unlock powerful functionality that most developers underuse.- When working with multi-dimensional arrays, the
axisparameter lets you find unique rows or columns, which is essential for data deduplication tasks. - NumPy’s unique function sorts results by default and treats each
np.nanas distinct—behaviors that can surprise you if you’re coming from pandas or pure Python sets.
Introduction
Finding unique values is one of those operations you’ll perform constantly in data analysis. Whether you’re cleaning datasets, encoding categorical variables, or simply exploring what values exist in your data, you need a reliable and fast way to extract distinct elements.
NumPy provides np.unique() as its primary tool for this task. While the basic usage is straightforward, the function has several parameters that transform it from a simple deduplication tool into a powerful utility for counting, indexing, and reconstructing arrays. Understanding these capabilities will save you from writing unnecessary loops and make your code both cleaner and faster.
Basic Usage of np.unique()
The simplest form of np.unique() takes an array and returns a sorted array of unique values:
import numpy as np
# Simple integer array with duplicates
data = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
unique_values = np.unique(data)
print(unique_values)
# Output: [1 2 3 4 5 6 9]
Notice two things immediately: the result is sorted in ascending order, and the original order of first appearance is not preserved. This sorting behavior is intentional—NumPy uses sorting as part of its algorithm to find unique values efficiently.
The function works with any data type NumPy supports:
# Works with strings
names = np.array(['alice', 'bob', 'alice', 'charlie', 'bob'])
print(np.unique(names))
# Output: ['alice' 'bob' 'charlie']
# Works with floats
measurements = np.array([1.5, 2.3, 1.5, 4.7, 2.3])
print(np.unique(measurements))
# Output: [1.5 2.3 4.7]
Getting Counts and Indices
The real power of np.unique() comes from its optional return parameters. These let you extract additional information in a single pass through the data.
Counting Occurrences
The return_counts parameter gives you the frequency of each unique value:
data = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
unique_values, counts = np.unique(data, return_counts=True)
print("Value | Count")
print("-" * 13)
for val, count in zip(unique_values, counts):
print(f" {val} | {count}")
# Output:
# Value | Count
# -------------
# 1 | 2
# 2 | 1
# 3 | 2
# 4 | 1
# 5 | 3
# 6 | 1
# 9 | 1
This is far more efficient than calling np.count_nonzero() or using list comprehensions for each value.
Finding Original Indices
The return_index parameter returns the indices of the first occurrence of each unique value in the original array:
data = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
unique_values, first_indices = np.unique(data, return_index=True)
print(f"Unique values: {unique_values}")
print(f"First occurrence indices: {first_indices}")
# Output:
# Unique values: [1 2 3 4 5 6 9]
# First occurrence indices: [1 6 0 2 4 5 3]
# Verify: data[1] = 1, data[6] = 2, data[0] = 3, etc.
Reconstructing the Original Array
The return_inverse parameter is perhaps the most useful for machine learning workflows. It returns indices that let you reconstruct the original array from the unique values:
data = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
unique_values, inverse_indices = np.unique(data, return_inverse=True)
print(f"Unique values: {unique_values}")
print(f"Inverse indices: {inverse_indices}")
# Output:
# Unique values: [1 2 3 4 5 6 9]
# Inverse indices: [2 0 3 0 4 6 1 5 4 2 4]
# Reconstruct original array
reconstructed = unique_values[inverse_indices]
print(f"Reconstructed: {reconstructed}")
print(f"Original: {data}")
print(f"Match: {np.array_equal(reconstructed, data)}")
# Output: Match: True
You can combine all three parameters in a single call:
unique_values, indices, inverse, counts = np.unique(
data,
return_index=True,
return_inverse=True,
return_counts=True
)
Working with Multi-Dimensional Arrays
By default, np.unique() flattens multi-dimensional arrays before finding unique values. The axis parameter changes this behavior:
# 2D array with duplicate rows
matrix = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3], # duplicate of row 0
[7, 8, 9],
[4, 5, 6], # duplicate of row 1
])
# Find unique rows (axis=0)
unique_rows = np.unique(matrix, axis=0)
print("Unique rows:")
print(unique_rows)
# Output:
# [[1 2 3]
# [4 5 6]
# [7 8 9]]
You can also find unique columns with axis=1:
# Array with duplicate columns
data = np.array([
[1, 2, 1, 3],
[4, 5, 4, 6],
[7, 8, 7, 9],
])
unique_cols = np.unique(data, axis=1)
print("Unique columns:")
print(unique_cols)
# Output:
# [[1 2 3]
# [4 5 6]
# [7 8 9]]
The return parameters work with axis as well, which is useful for tracking which rows or columns were duplicates:
matrix = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[7, 8, 9],
])
unique_rows, indices, inverse, counts = np.unique(
matrix, axis=0,
return_index=True,
return_inverse=True,
return_counts=True
)
print(f"Row [1, 2, 3] appears {counts[0]} times")
# Output: Row [1, 2, 3] appears 2 times
Handling Special Cases
NaN Values
NumPy treats each np.nan as a distinct value, which differs from how pandas handles NaN:
data = np.array([1.0, np.nan, 2.0, np.nan, 1.0])
unique_values = np.unique(data)
print(unique_values)
# Output: [ 1. 2. nan nan]
Both NaN values appear in the result because np.nan != np.nan evaluates to True. If you need to treat NaN as a single value, filter it out first:
data = np.array([1.0, np.nan, 2.0, np.nan, 1.0])
# Remove NaN, find unique, then add NaN back if needed
clean_data = data[~np.isnan(data)]
unique_values = np.unique(clean_data)
if np.any(np.isnan(data)):
unique_values = np.append(unique_values, np.nan)
print(unique_values)
# Output: [1. 2. nan]
Sorting Behavior
The sorting is unavoidable with np.unique(). If you need to preserve the order of first appearance, use return_index and sort by those indices:
data = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
unique_values, first_indices = np.unique(data, return_index=True)
# Sort by first appearance
order = np.argsort(first_indices)
unique_ordered = unique_values[order]
print(f"Sorted unique: {unique_values}")
print(f"Order of appearance: {unique_ordered}")
# Output:
# Sorted unique: [1 2 3 4 5 6 9]
# Order of appearance: [3 1 4 5 9 2 6]
Performance Considerations
For pure NumPy arrays, np.unique() is typically your best choice. But alternatives exist, and the right choice depends on your data:
import time
# Generate large array with many duplicates
np.random.seed(42)
large_array = np.random.randint(0, 1000, size=10_000_000)
# NumPy unique
start = time.perf_counter()
np_result = np.unique(large_array)
np_time = time.perf_counter() - start
# Python set (requires conversion)
start = time.perf_counter()
set_result = np.array(sorted(set(large_array)))
set_time = time.perf_counter() - start
# Pandas unique (if available)
import pandas as pd
start = time.perf_counter()
pd_result = pd.unique(large_array)
pd_time = time.perf_counter() - start
print(f"NumPy unique: {np_time:.4f}s")
print(f"Python set: {set_time:.4f}s")
print(f"Pandas unique: {pd_time:.4f}s")
# Typical output:
# NumPy unique: 0.4521s
# Python set: 1.2834s
# Pandas unique: 0.1247s
Pandas is often faster because it doesn’t sort the results. If you don’t need sorted output and are already using pandas, pd.unique() is a solid choice. For pure NumPy workflows, stick with np.unique().
Practical Applications
Preprocessing Categorical Data for Machine Learning
The return_inverse parameter provides a clean way to encode categorical variables:
# Raw categorical data
categories = np.array(['red', 'blue', 'green', 'blue', 'red', 'green', 'red'])
# Encode to integers
unique_categories, encoded = np.unique(categories, return_inverse=True)
print(f"Categories: {unique_categories}")
print(f"Encoded: {encoded}")
# Output:
# Categories: ['blue' 'green' 'red']
# Encoded: [2 0 1 0 2 1 2]
# Create a decoder dictionary
decoder = {i: cat for i, cat in enumerate(unique_categories)}
print(f"Decoder: {decoder}")
# Output: {0: 'blue', 1: 'green', 2: 'red'}
Data Deduplication with Tracking
When cleaning datasets, you often need to know which records were duplicates:
# Simulated user activity logs (user_id, action_id, timestamp_bucket)
logs = np.array([
[101, 1, 1000],
[102, 2, 1000],
[101, 1, 1000], # duplicate
[103, 1, 1001],
[102, 2, 1000], # duplicate
])
unique_logs, indices, counts = np.unique(
logs, axis=0,
return_index=True,
return_counts=True
)
print(f"Original records: {len(logs)}")
print(f"Unique records: {len(unique_logs)}")
print(f"Duplicate counts: {counts[counts > 1]}")
# Output:
# Original records: 5
# Unique records: 3
# Duplicate counts: [2 2]
The combination of np.unique() with its return parameters handles the vast majority of unique-value operations you’ll encounter. Master these patterns, and you’ll write cleaner, faster NumPy code.