Rust HashSet: Unique Value Collections
Rust's `HashSet<T>` is a collection that stores unique values with no defined order. Under the hood, it's implemented as a `HashMap<T, ()>` where only the keys matter. This gives you O(1)...
Key Insights
- HashSet provides O(1) average-case lookups and automatic deduplication, making it ideal for membership testing and ensuring unique collections—use it when you need fast “does this exist?” checks rather than ordered or indexed access.
- Set operations like union, intersection, and difference are first-class citizens in Rust’s HashSet API, enabling clean and efficient solutions for problems involving overlapping data, permissions, or filtering duplicate entries across collections.
- Custom types require implementing both Hash and Eq traits to work in a HashSet, and the implementation must maintain the invariant that equal values produce equal hashes—violating this causes undefined behavior in lookups and insertions.
Introduction to HashSet
Rust’s HashSet<T> is a collection that stores unique values with no defined order. Under the hood, it’s implemented as a HashMap<T, ()> where only the keys matter. This gives you O(1) average-case performance for insertions, deletions, and lookups—significantly faster than a Vec for membership testing.
Use a HashSet when you need to ensure uniqueness or frequently check whether an item exists in a collection. If you’re iterating through a Vec calling contains() repeatedly, you’re doing O(n) work each time. A HashSet reduces that to O(1) on average.
The tradeoff? HashSets are unordered and require more memory than a simple vector. You also can’t index into them or maintain insertion order (use IndexSet from the indexmap crate if you need that).
use std::collections::HashSet;
fn main() {
// Vec allows duplicates
let mut vec = vec![1, 2, 3, 2, 1];
println!("Vec: {:?}", vec); // Vec: [1, 2, 3, 2, 1]
// HashSet automatically deduplicates
let mut set: HashSet<i32> = vec.into_iter().collect();
println!("HashSet: {:?}", set); // HashSet: {1, 2, 3} (order varies)
// Fast membership testing
println!("Contains 2: {}", set.contains(&2)); // true
println!("Contains 5: {}", set.contains(&5)); // false
}
Creating and Populating HashSets
Rust provides multiple ways to create HashSets depending on your data source. The most basic is HashSet::new(), which creates an empty set with default capacity.
use std::collections::HashSet;
fn main() {
// Empty HashSet
let mut set1: HashSet<i32> = HashSet::new();
// From an array using from()
let set2 = HashSet::from([1, 2, 3, 4]);
// From an iterator using collect()
let set3: HashSet<_> = (1..=5).collect();
// Pre-allocate capacity for better performance
let mut set4 = HashSet::with_capacity(100);
for i in 0..100 {
set4.insert(i);
}
println!("Set4 len: {}, capacity: {}", set4.len(), set4.capacity());
}
Pre-allocating capacity with with_capacity() is important when you know the approximate size of your dataset. This prevents multiple reallocations as the set grows, improving insertion performance.
Here’s a practical example of building a HashSet from file data to track unique IP addresses:
use std::collections::HashSet;
fn parse_unique_ips(log_data: &str) -> HashSet<String> {
log_data
.lines()
.filter_map(|line| {
line.split_whitespace()
.next()
.map(|ip| ip.to_string())
})
.collect()
}
fn main() {
let logs = "192.168.1.1 GET /home
192.168.1.2 POST /api
192.168.1.1 GET /about
192.168.1.3 GET /home";
let unique_ips = parse_unique_ips(logs);
println!("Unique visitors: {}", unique_ips.len()); // 3
}
Core Operations
HashSet provides intuitive methods for manipulating data. The insert() method returns true if the value was newly inserted, false if it already existed. The remove() method returns true if the value was present and removed.
use std::collections::HashSet;
fn main() {
let mut active_sessions: HashSet<String> = HashSet::new();
// Insert returns bool indicating if value was new
let was_new = active_sessions.insert("user123".to_string());
println!("First insert: {}", was_new); // true
let was_new = active_sessions.insert("user123".to_string());
println!("Duplicate insert: {}", was_new); // false
// Check membership
if active_sessions.contains("user123") {
println!("Session active");
}
// Remove returns bool indicating if value existed
let existed = active_sessions.remove("user123");
println!("Removed: {}", existed); // true
// Other useful methods
println!("Length: {}", active_sessions.len());
println!("Is empty: {}", active_sessions.is_empty());
active_sessions.insert("user456".to_string());
active_sessions.clear();
println!("After clear: {}", active_sessions.len()); // 0
}
A practical use case is tracking unique event IDs to prevent duplicate processing:
use std::collections::HashSet;
struct EventProcessor {
processed_ids: HashSet<u64>,
}
impl EventProcessor {
fn new() -> Self {
Self {
processed_ids: HashSet::new(),
}
}
fn process_event(&mut self, event_id: u64) -> bool {
if self.processed_ids.insert(event_id) {
println!("Processing event {}", event_id);
true
} else {
println!("Skipping duplicate event {}", event_id);
false
}
}
}
fn main() {
let mut processor = EventProcessor::new();
processor.process_event(101); // Processing
processor.process_event(102); // Processing
processor.process_event(101); // Skipping duplicate
}
Set Operations
HashSet excels at mathematical set operations. These methods return iterators, so you typically collect them into a new HashSet or Vec.
use std::collections::HashSet;
fn main() {
let set_a: HashSet<_> = [1, 2, 3, 4].into_iter().collect();
let set_b: HashSet<_> = [3, 4, 5, 6].into_iter().collect();
// Union: all elements from both sets
let union: HashSet<_> = set_a.union(&set_b).copied().collect();
println!("Union: {:?}", union); // {1, 2, 3, 4, 5, 6}
// Intersection: elements in both sets
let intersection: HashSet<_> = set_a.intersection(&set_b).copied().collect();
println!("Intersection: {:?}", intersection); // {3, 4}
// Difference: elements in A but not B
let difference: HashSet<_> = set_a.difference(&set_b).copied().collect();
println!("Difference: {:?}", difference); // {1, 2}
// Symmetric difference: elements in either set but not both
let sym_diff: HashSet<_> = set_a.symmetric_difference(&set_b).copied().collect();
println!("Symmetric diff: {:?}", sym_diff); // {1, 2, 5, 6}
// Subset/superset checks
let small_set: HashSet<_> = [1, 2].into_iter().collect();
println!("Is subset: {}", small_set.is_subset(&set_a)); // true
println!("Is superset: {}", set_a.is_superset(&small_set)); // true
}
A real-world scenario: comparing user permissions across roles:
use std::collections::HashSet;
fn main() {
let admin_perms: HashSet<&str> =
["read", "write", "delete", "admin"].into_iter().collect();
let user_perms: HashSet<&str> =
["read", "write"].into_iter().collect();
// What can admins do that users can't?
let admin_only: Vec<_> = admin_perms.difference(&user_perms).collect();
println!("Admin-only permissions: {:?}", admin_only);
// What permissions do they share?
let common: Vec<_> = admin_perms.intersection(&user_perms).collect();
println!("Common permissions: {:?}", common);
}
Iteration and Transformation
HashSets are iterable, but remember: iteration order is not guaranteed and will vary between runs due to hash randomization.
use std::collections::HashSet;
fn main() {
let tags: HashSet<_> = ["rust", "programming", "systems"].into_iter().collect();
// Basic iteration
for tag in &tags {
println!("Tag: {}", tag);
}
// Filter while iterating
let long_tags: Vec<_> = tags.iter()
.filter(|tag| tag.len() > 5)
.collect();
println!("Long tags: {:?}", long_tags);
// Convert to sorted Vec
let mut sorted_tags: Vec<_> = tags.iter().collect();
sorted_tags.sort();
println!("Sorted: {:?}", sorted_tags);
// Drain removes and returns elements
let mut numbers: HashSet<_> = (1..=5).collect();
let drained: Vec<_> = numbers.drain().collect();
println!("Drained: {:?}, remaining: {}", drained, numbers.len());
}
Chaining iterator methods enables powerful data transformations:
use std::collections::HashSet;
fn extract_domains(emails: &[&str]) -> HashSet<String> {
emails.iter()
.filter_map(|email| email.split('@').nth(1))
.map(|domain| domain.to_lowercase())
.collect()
}
fn main() {
let emails = ["alice@EXAMPLE.com", "bob@test.com", "charlie@example.com"];
let domains = extract_domains(&emails);
println!("Unique domains: {:?}", domains); // {example.com, test.com}
}
Custom Types and Hashing
To store custom types in a HashSet, they must implement Hash and Eq. The critical invariant: if two values are equal, they must produce the same hash.
use std::collections::HashSet;
use std::hash::{Hash, Hasher};
#[derive(Debug, Clone)]
struct User {
id: u64,
name: String,
email: String,
}
// Manual implementation to hash only by ID
impl Hash for User {
fn hash<H: Hasher>(&self, state: &mut H) {
self.id.hash(state);
}
}
impl PartialEq for User {
fn eq(&self, other: &Self) -> bool {
self.id == other.id
}
}
impl Eq for User {}
fn main() {
let mut users = HashSet::new();
users.insert(User {
id: 1,
name: "Alice".to_string(),
email: "alice@example.com".to_string(),
});
// Same ID, different data - treated as duplicate
let duplicate = User {
id: 1,
name: "Alice Updated".to_string(),
email: "newemail@example.com".to_string(),
};
println!("Inserted duplicate: {}", users.insert(duplicate)); // false
println!("Total users: {}", users.len()); // 1
}
For simple cases, use derive macros:
use std::collections::HashSet;
#[derive(Debug, Hash, Eq, PartialEq)]
struct Product {
sku: String,
name: String,
}
fn main() {
let mut inventory = HashSet::new();
inventory.insert(Product {
sku: "ABC123".to_string(),
name: "Widget".to_string(),
});
}
Warning: Never use floating-point types in HashSets directly. f32 and f64 don’t implement Hash because NaN != NaN, violating hash requirements. Wrap them in a custom type if needed.
Performance Considerations and Best Practices
HashSet provides O(1) average-case operations, but worst-case is O(n) if many hash collisions occur. For sorted iteration or range queries, use BTreeSet instead (O(log n) operations).
use std::collections::{HashSet, BTreeSet};
use std::time::Instant;
fn benchmark_contains() {
let size = 100_000;
let vec: Vec<i32> = (0..size).collect();
let hashset: HashSet<i32> = (0..size).collect();
let btreeset: BTreeSet<i32> = (0..size).collect();
let search_value = size - 1;
// Vec: O(n)
let start = Instant::now();
vec.contains(&search_value);
println!("Vec contains: {:?}", start.elapsed());
// HashSet: O(1) average
let start = Instant::now();
hashset.contains(&search_value);
println!("HashSet contains: {:?}", start.elapsed());
// BTreeSet: O(log n)
let start = Instant::now();
btreeset.contains(&search_value);
println!("BTreeSet contains: {:?}", start.elapsed());
}
Pre-allocate capacity when you know the dataset size:
use std::collections::HashSet;
fn load_large_dataset(items: Vec<String>) -> HashSet<String> {
let mut set = HashSet::with_capacity(items.len());
for item in items {
set.insert(item);
}
set
}
Choose the right collection for your use case: HashSet for fast lookups and uniqueness, Vec for ordered or indexed access, and BTreeSet for sorted iteration or range queries. Understanding these tradeoffs will make your Rust code both correct and performant.