NoSQL Column-Family: Cassandra and HBase
Column-family databases represent a fundamental shift from traditional relational models. Instead of organizing data into normalized tables with fixed schemas, they store data in wide rows where each...
Key Insights
- Column-family databases store data in wide rows with dynamic columns, optimizing for horizontal scalability and high write throughput rather than relational integrity
- Cassandra uses a masterless peer-to-peer architecture with tunable consistency, while HBase follows a master-worker model tightly integrated with Hadoop’s HDFS
- Choose Cassandra for globally distributed writes and multi-datacenter deployments; choose HBase when you need strong consistency and tight Hadoop ecosystem integration
Introduction to Column-Family Databases
Column-family databases represent a fundamental shift from traditional relational models. Instead of organizing data into normalized tables with fixed schemas, they store data in wide rows where each row can have thousands or millions of columns, and different rows can have completely different column sets.
The data model consists of three key components: rows identified by unique keys, column families that group related columns together, and individual columns within those families. This structure excels at storing sparse data where most columns are empty for any given row.
Here’s how the models differ conceptually:
Relational Model:
UserID | Name | Email | LastLogin
1 | Alice | alice@ex.com | 2024-01-15
2 | Bob | bob@ex.com | 2024-01-14
Column-Family Model:
Row Key: user:1
profile:name = "Alice"
profile:email = "alice@ex.com"
activity:lastLogin = "2024-01-15"
activity:pageViews = 150
Row Key: user:2
profile:name = "Bob"
profile:email = "bob@ex.com"
activity:lastLogin = "2024-01-14"
activity:loginCount = 47
Notice how user:2 has different columns than user:1. This flexibility enables schema evolution without migrations and efficient storage of sparse datasets.
Apache Cassandra Architecture
Cassandra’s architecture eliminates single points of failure through its masterless, peer-to-peer design. Every node can handle reads and writes, and data is automatically distributed across the cluster using consistent hashing based on partition keys.
The partition key determines which nodes store your data. Cassandra replicates each partition to multiple nodes based on your replication strategy, and you can tune consistency levels per-query to balance availability and consistency.
Creating a keyspace with replication:
CREATE KEYSPACE sensor_data
WITH replication = {
'class': 'NetworkTopologyStrategy',
'datacenter1': 3,
'datacenter2': 2
}
AND durable_writes = true;
This creates a keyspace replicated three times in datacenter1 and twice in datacenter2, enabling multi-datacenter deployments with datacenter-aware routing.
Defining a table with proper key design:
CREATE TABLE sensor_data.temperature_readings (
sensor_id uuid,
reading_date date,
reading_time timestamp,
temperature decimal,
humidity decimal,
PRIMARY KEY ((sensor_id, reading_date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);
The partition key (sensor_id, reading_date) ensures all readings for a sensor on a given date are stored together on the same nodes. The clustering key reading_time orders data within each partition, optimizing time-range queries.
Basic operations:
-- Insert data
INSERT INTO sensor_data.temperature_readings
(sensor_id, reading_date, reading_time, temperature, humidity)
VALUES (uuid(), '2024-01-15', toTimestamp(now()), 22.5, 45.0);
-- Query by partition
SELECT * FROM sensor_data.temperature_readings
WHERE sensor_id = 123e4567-e89b-12d3-a456-426614174000
AND reading_date = '2024-01-15'
AND reading_time >= '2024-01-15 00:00:00'
AND reading_time < '2024-01-15 12:00:00';
-- Update (actually an upsert in Cassandra)
UPDATE sensor_data.temperature_readings
SET temperature = 23.0
WHERE sensor_id = 123e4567-e89b-12d3-a456-426614174000
AND reading_date = '2024-01-15'
AND reading_time = '2024-01-15 10:30:00';
Cassandra’s write path is optimized for speed: writes go to a commit log for durability, then to an in-memory structure (memtable), and eventually flush to disk as immutable SSTables. This design enables extremely high write throughput.
Apache HBase Architecture
HBase takes a different approach, building on Hadoop’s HDFS for storage and following a master-worker architecture. The HMaster coordinates the cluster, while RegionServers handle actual data operations. HBase automatically splits tables into regions (contiguous row ranges) distributed across RegionServers.
This tight Hadoop integration means HBase excels when you’re already invested in the Hadoop ecosystem and need to run MapReduce jobs directly against your database.
Creating a table in HBase Shell:
create 'sensor_readings',
{NAME => 'data', VERSIONS => 5, COMPRESSION => 'SNAPPY'},
{NAME => 'metadata', VERSIONS => 1}
This creates a table with two column families: ‘data’ keeping five versions with Snappy compression, and ‘metadata’ keeping only the latest version.
Java API for data operations:
Configuration config = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("sensor_readings"));
// Put data
Put put = new Put(Bytes.toBytes("sensor001:20240115:103000"));
put.addColumn(Bytes.toBytes("data"),
Bytes.toBytes("temperature"),
Bytes.toBytes("22.5"));
put.addColumn(Bytes.toBytes("data"),
Bytes.toBytes("humidity"),
Bytes.toBytes("45.0"));
table.put(put);
// Get data
Get get = new Get(Bytes.toBytes("sensor001:20240115:103000"));
Result result = table.get(get);
byte[] temp = result.getValue(Bytes.toBytes("data"),
Bytes.toBytes("temperature"));
// Scan with filter
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("sensor001:20240115:000000"));
scan.setStopRow(Bytes.toBytes("sensor001:20240115:235959"));
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL);
filters.addFilter(new SingleColumnValueFilter(
Bytes.toBytes("data"),
Bytes.toBytes("temperature"),
CompareOperator.GREATER,
Bytes.toBytes("25.0")
));
scan.setFilter(filters);
ResultScanner scanner = table.getScanner(scan);
for (Result r : scanner) {
// Process results
}
scanner.close();
table.close();
connection.close();
Data Modeling Patterns
Both systems require denormalization and designing for your query patterns. Forget normalization—you’ll duplicate data to avoid joins.
Time-series pattern in Cassandra:
-- Optimized for recent data queries
CREATE TABLE metrics_by_hour (
metric_name text,
bucket_hour timestamp,
event_time timestamp,
value double,
tags map<text, text>,
PRIMARY KEY ((metric_name, bucket_hour), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND default_time_to_live = 2592000; -- 30 days
The bucketing strategy limits partition size while TTL automatically expires old data.
Wide-row sensor data in HBase:
Row Key: sensor001
data:2024-01-15T10:30:00 = "temp:22.5,humidity:45.0"
data:2024-01-15T10:31:00 = "temp:22.6,humidity:45.2"
data:2024-01-15T10:32:00 = "temp:22.4,humidity:45.1"
... thousands more columns ...
Each column represents a timestamp, enabling efficient range scans without reading irrelevant data.
Denormalization example:
-- Instead of normalized user and orders tables,
-- duplicate user data in orders table
CREATE TABLE orders_by_user (
user_id uuid,
order_date date,
order_id uuid,
user_name text, -- Denormalized
user_email text, -- Denormalized
order_total decimal,
items list<frozen<order_item>>,
PRIMARY KEY ((user_id), order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id DESC);
Performance Characteristics & Use Cases
Cassandra excels at write-heavy workloads with its log-structured storage and masterless architecture. It’s ideal for:
- Time-series data (IoT sensors, metrics, logs)
- Shopping carts and user sessions
- Message queuing systems
- Any globally distributed application requiring multi-datacenter writes
HBase shines for:
- Large-scale batch processing with MapReduce
- Random real-time read/write access to big data
- Applications already using HDFS
- Scenarios requiring strong consistency
Performance configuration example:
-- Cassandra: Optimize for write-heavy workload
CREATE TABLE high_write_volume (
id uuid PRIMARY KEY,
data text
) WITH compaction = {
'class': 'LeveledCompactionStrategy',
'sstable_size_in_mb': 160
}
AND compression = {
'sstable_compression': 'LZ4Compressor'
};
-- For read-heavy, use SizeTieredCompactionStrategy
-- and increase cache sizes
Operational Considerations
Cassandra operations:
# Check cluster status
nodetool status
# Repair data inconsistencies
nodetool repair -pr
# Flush memtables to disk
nodetool flush
# Compact SSTables
nodetool compact
# Take snapshot for backup
nodetool snapshot sensor_data
HBase operations:
# Balance regions across servers
hbase shell
> balance_switch true
> balancer
# Compact regions
> major_compact 'sensor_readings'
# Split region manually
> split 'sensor_readings', 'sensor500'
# Backup using snapshots
> snapshot 'sensor_readings', 'backup_20240115'
Cassandra requires monitoring compaction lag, heap usage, and read/write latencies. HBase demands attention to region hotspotting, HDFS health, and ZooKeeper stability.
Conclusion
Cassandra and HBase solve similar problems with different philosophies. Cassandra’s masterless architecture and tunable consistency make it the default choice for internet-scale applications requiring high availability and multi-datacenter deployments. Its CQL interface also provides a gentler learning curve for SQL-familiar teams.
HBase makes sense when you’re already invested in Hadoop, need strong consistency guarantees, or require tight integration with MapReduce and other Hadoop ecosystem tools. Its master-worker architecture provides simpler reasoning about consistency but introduces potential bottlenecks.
The decision framework: Choose Cassandra for write-heavy, globally distributed applications where eventual consistency is acceptable. Choose HBase for Hadoop-centric architectures requiring batch processing capabilities and stronger consistency. Both require careful data modeling and operational expertise, but they’ll scale far beyond what traditional relational databases can handle.