ELK Stack: Elasticsearch, Logstash, Kibana
When your application runs on a single server, tailing log files works fine. Scale to dozens of microservices across multiple hosts, and you'll quickly drown in SSH sessions and grep commands. The...
Key Insights
- The ELK Stack provides centralized logging with Elasticsearch for storage/search, Logstash for data ingestion/transformation, and Kibana for visualization—together they handle billions of log events daily in production environments
- Elasticsearch’s inverted index architecture enables sub-second searches across terabytes of log data, while its distributed nature provides horizontal scalability and fault tolerance through sharding and replication
- Proper index lifecycle management and retention policies are critical—without them, you’ll quickly consume disk space and degrade cluster performance as indices grow unchecked
Introduction to the ELK Stack
When your application runs on a single server, tailing log files works fine. Scale to dozens of microservices across multiple hosts, and you’ll quickly drown in SSH sessions and grep commands. The ELK Stack solves centralized logging by aggregating logs from all sources into a searchable, analyzable system.
Each component has a distinct role: Elasticsearch stores and indexes log data, making it searchable at scale. Logstash ingests logs from various sources, parses them into structured data, and forwards them to Elasticsearch. Kibana provides a web interface for searching logs, building visualizations, and creating dashboards.
Common use cases extend beyond simple log aggregation. Security teams use ELK for threat detection and incident response. DevOps teams monitor application performance and troubleshoot production issues. Business analysts extract metrics from application logs for intelligence gathering.
Elasticsearch: The Search and Analytics Engine
Elasticsearch is a distributed search engine built on Apache Lucene. It stores data as JSON documents organized into indices. Think of an index as a database and documents as rows, though this analogy breaks down quickly—Elasticsearch is schema-flexible and optimized for full-text search rather than transactional consistency.
The core concepts you need to understand:
Indices are logical namespaces that hold documents. You typically create time-based indices for logs (e.g., logs-2024-01-15) to simplify retention management.
Sharding splits an index across multiple nodes for horizontal scaling. Each shard is a self-contained Lucene index. More shards mean better parallelization but increased overhead.
Replication creates copies of shards for fault tolerance and read throughput. A replica shard serves read requests and takes over if the primary shard fails.
Here’s how to create an index with explicit mappings:
PUT /application-logs-2024-01
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.refresh_interval": "5s"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"level": { "type": "keyword" },
"message": { "type": "text" },
"service": { "type": "keyword" },
"duration_ms": { "type": "integer" },
"user_id": { "type": "keyword" },
"ip_address": { "type": "ip" }
}
}
}
Use keyword type for exact matches and aggregations, text for full-text search. The refresh_interval controls how quickly new documents become searchable—higher values improve indexing performance.
Searching uses Query DSL, a JSON-based query language:
POST /application-logs-2024-01/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "ERROR" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
],
"filter": [
{ "term": { "service": "payment-api" } }
]
}
},
"sort": [{ "@timestamp": "desc" }],
"size": 100
}
Aggregations enable analytics on log data:
POST /application-logs-2024-01/_search
{
"size": 0,
"aggs": {
"errors_by_service": {
"terms": { "field": "service", "size": 10 },
"aggs": {
"avg_duration": {
"avg": { "field": "duration_ms" }
}
}
},
"errors_over_time": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1m"
}
}
}
}
This returns error counts per service with average duration, plus a time-series histogram.
Logstash: Data Processing Pipeline
Logstash is an ETL pipeline for logs. It follows an input-filter-output architecture: inputs receive data, filters transform it, outputs send it to destinations.
A typical Logstash configuration:
input {
tcp {
port => 5000
codec => json_lines
}
file {
path => "/var/log/nginx/access.log"
start_position => "beginning"
}
}
filter {
if [type] == "nginx" {
grok {
match => {
"message" => '%{IPORHOST:client_ip} - - \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status_code} %{NUMBER:bytes_sent} "%{DATA:referrer}" "%{DATA:user_agent}"'
}
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
target => "@timestamp"
}
mutate {
convert => { "status_code" => "integer" }
convert => { "bytes_sent" => "integer" }
remove_field => ["message", "timestamp"]
}
}
if [level] == "ERROR" {
mutate {
add_tag => ["alert"]
}
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
user => "elastic"
password => "${ELASTIC_PASSWORD}"
}
if "alert" in [tags] {
email {
to => "ops@company.com"
subject => "Error Alert: %{service}"
body => "Error in %{service}: %{message}"
}
}
}
The grok filter is crucial—it parses unstructured text into structured fields using regex patterns. Logstash includes patterns for common formats, but you’ll write custom patterns for your application logs.
For application logging, send JSON directly to avoid parsing overhead:
import logging
import logstash
import sys
logger = logging.getLogger('python-logstash')
logger.setLevel(logging.INFO)
logger.addHandler(logstash.TCPLogstashHandler('logstash', 5000, version=1))
logger.info('Payment processed', extra={
'user_id': '12345',
'amount': 99.99,
'service': 'payment-api'
})
Kibana: Visualization and Exploration
Kibana is your window into Elasticsearch. The Discover interface lets you explore logs with KQL (Kibana Query Language):
level: ERROR and service: "payment-api" and @timestamp >= now-1h
KQL is simpler than Query DSL for interactive searching. It supports wildcards, boolean operators, and range queries.
Create index patterns to tell Kibana which indices to search. Pattern logs-* matches all indices starting with “logs-”. Configure the timestamp field so Kibana can filter by time range.
Dashboards combine multiple visualizations. Here’s a JSON configuration for a simple dashboard panel:
{
"title": "Error Rate by Service",
"type": "line",
"params": {
"type": "line",
"grid": { "categoryLines": false },
"categoryAxes": [{
"id": "CategoryAxis-1",
"type": "category",
"position": "bottom",
"show": true,
"title": {}
}],
"valueAxes": [{
"id": "ValueAxis-1",
"name": "LeftAxis-1",
"type": "value",
"position": "left",
"show": true,
"title": { "text": "Count" }
}],
"seriesParams": [{
"show": true,
"type": "line",
"mode": "normal",
"data": {
"label": "Count",
"id": "1"
},
"valueAxis": "ValueAxis-1"
}]
}
}
Most teams build dashboards through the UI rather than JSON, but understanding the structure helps with version control and automation.
Setting Up a Complete ELK Stack
Docker Compose provides the fastest path to a working ELK stack:
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: elasticsearch
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
- xpack.security.enabled=true
- ELASTIC_PASSWORD=changeme
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
networks:
- elk
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
container_name: logstash
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5000:5000"
- "9600:9600"
environment:
- ELASTIC_PASSWORD=changeme
depends_on:
- elasticsearch
networks:
- elk
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
container_name: kibana
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=elastic
- ELASTICSEARCH_PASSWORD=changeme
depends_on:
- elasticsearch
networks:
- elk
volumes:
es_data:
networks:
elk:
driver: bridge
This configuration uses single-node mode for development. Production deployments need a proper cluster with at least three master-eligible nodes.
Best Practices and Performance Optimization
Index Lifecycle Management (ILM) automates index rollover and deletion:
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
Apply ILM policies through index templates:
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-policy",
"index.lifecycle.rollover_alias": "logs"
}
}
}
Resource sizing: Allocate 50% of available RAM to Elasticsearch heap, up to 32GB maximum. Beyond 32GB, you lose compressed object pointers and waste memory. Run multiple nodes instead.
Monitor cluster health:
GET _cluster/health
GET _cat/nodes?v
GET _cat/indices?v&s=store.size:desc
Watch for yellow or red cluster status, high heap usage, and slow indexing rates. These indicate undersized clusters or misconfigured indices.
The ELK Stack scales from small deployments to petabyte-scale installations. Start simple, monitor performance, and scale components independently as needs grow.