DevOpsApril 13, 202612 min read

Elasticsearch Operations Guide: Cluster Sizing, ILM, and Performance Tuning

Share:

Free DevOps Audit Checklist

Get our comprehensive checklist to identify gaps in your infrastructure, security, and deployment processes

Instant delivery. No spam, ever.

Introduction

Elasticsearch is one of those tools that is easy to set up but notoriously difficult to operate at scale. A basic single-node cluster handles your first million documents fine, then things start breaking: slow queries, disk pressure alerts, unbalanced shards, and the dreaded cluster yellow or red status at 3 AM.

This guide covers the operational side of Elasticsearch - the things you learn after running it in production for years. We will go through cluster sizing, index lifecycle management (ILM), shard allocation strategies, snapshot and restore procedures, and performance tuning techniques that actually matter.

Cluster Sizing and Hardware Planning

The most common mistake in Elasticsearch sizing is starting with too few nodes and too little memory. Here are the numbers that matter:

Memory: Elasticsearch lives and dies by its JVM heap and filesystem cache. The rule of thumb: give Elasticsearch 50% of available RAM as JVM heap (capped at 31GB to stay under compressed oops threshold), and leave the other 50% for the OS filesystem cache.

# jvm.options
-Xms16g
-Xmx16g
# Never set heap above 31g - you lose compressed ordinary object pointers

For a production logging cluster handling ~100GB/day of logs:

Role Nodes CPU RAM Storage
Master-eligible 3 2 vCPU 8 GB 50 GB SSD
Data (hot) 3 8 vCPU 64 GB 2 TB NVMe SSD
Data (warm) 2 4 vCPU 32 GB 8 TB HDD
Coordinating 2 4 vCPU 16 GB 100 GB SSD

Dedicated master nodes are non-negotiable for any production cluster. Master nodes handle cluster state, shard allocation, and index creation. If your master nodes are also serving queries and indexing data, cluster instability follows.

# elasticsearch.yml - dedicated master node
node.roles: [master]
cluster.name: production-logs
node.name: master-1
# elasticsearch.yml - hot data node
node.roles: [data_hot, data_content, ingest]
node.attr.data_tier: hot
cluster.name: production-logs
node.name: hot-data-1

AWS-specific sizing: On EC2, use r6g instances for data nodes (memory-optimized, Graviton for cost savings) and m6g for master/coordinating nodes. Use gp3 EBS volumes - they offer 3000 baseline IOPS regardless of volume size, which is more cost-effective than gp2 for most Elasticsearch workloads.

Need DevOps help?

InstaDevOps provides expert DevOps engineering starting at $2,999/mo. Skip the hiring headache.

Book a free 15-min call →

Index Lifecycle Management (ILM)

ILM automates the progression of indices through hot, warm, cold, and delete phases. Without ILM, you end up with hundreds of indices eating resources forever, or someone writes a fragile cron job to delete old indices.

Here is a production ILM policy for application logs:

PUT _ilm/policy/logs-lifecycle
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Apply the policy to an index template:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-lifecycle",
      "index.lifecycle.rollover_alias": "logs-write",
      "index.routing.allocation.require.data": "hot"
    }
  }
}

Key ILM tuning tips:

  • Rollover on shard size (50GB max primary shard) rather than document count - shard size is what affects performance
  • Force merge to 1 segment in the warm phase to reclaim disk and improve query performance on read-only indices
  • Shrink indices in warm phase if you started with multiple shards for write throughput but no longer need them

Shard Allocation and Rebalancing

Sharding is where most Elasticsearch performance problems originate. The guidelines are straightforward but frequently violated:

Rule 1: Keep shards between 10GB and 50GB. Shards under 1GB waste resources on overhead. Shards over 50GB make recovery slow and rebalancing painful.

Rule 2: Aim for fewer than 20 shards per GB of heap. A node with 16GB heap should host no more than ~300 shards. Each shard consumes memory for metadata, field data, and segment info regardless of whether it is being queried.

Check your current shard distribution:

# Shard count per node
curl -s localhost:9200/_cat/allocation?v

# Shards sorted by size
curl -s localhost:9200/_cat/shards?v\&s=store:desc

# Identify hot spots - nodes with too many shards
curl -s localhost:9200/_cat/nodes?v\&h=name,heap.percent,disk.used_percent,shards

When shards are unevenly distributed, you can adjust allocation awareness:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": "us-east-1a,us-east-1b,us-east-1c"
  }
}

This ensures that primary and replica shards land in different availability zones, giving you zone-failure resilience.

Snapshot and Restore

Snapshots are your safety net. Without them, a bad mapping change, accidental index deletion, or cluster corruption means data loss. Set up automated snapshots to S3:

PUT _snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-snapshots",
    "region": "us-east-1",
    "base_path": "production",
    "max_snapshot_bytes_per_sec": "200mb",
    "max_restore_bytes_per_sec": "200mb"
  }
}

Create a snapshot lifecycle management (SLM) policy:

PUT _slm/policy/nightly-backup
{
  "schedule": "0 0 2 * * ?",
  "name": "<nightly-snap-{now/d}>",
  "repository": "s3-backup",
  "config": {
    "indices": ["logs-*", "metrics-*", "app-*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 7,
    "max_count": 30
  }
}

Test your restores regularly. A snapshot you have never restored is a backup you do not have. Schedule a quarterly restore drill:

# Restore a specific index to verify backup integrity
curl -X POST "localhost:9200/_snapshot/s3-backup/nightly-snap-2026.04.01/_restore" \
  -H 'Content-Type: application/json' -d '{
    "indices": "logs-2026.03.31",
    "rename_pattern": "(.+)",
    "rename_replacement": "restored-$1"
  }'

# Verify document count matches
curl -s "localhost:9200/restored-logs-2026.03.31/_count"
curl -s "localhost:9200/logs-2026.03.31/_count"

Performance Tuning

Here are the performance levers that have the biggest real-world impact, ranked by effort-to-benefit ratio:

1. Use bulk indexing, always. Single-document indexing is orders of magnitude slower. Aim for bulk requests of 5-15MB.

# Bad: indexing one document at a time
curl -X POST "localhost:9200/logs/_doc" -d '{"message": "log entry"}'

# Good: bulk indexing
curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @bulk-data.ndjson

2. Tune refresh interval for write-heavy indices. The default 1-second refresh creates a new Lucene segment every second, which is expensive. For logging use cases where near-real-time search is not critical:

PUT logs-write/_settings
{
  "index": {
    "refresh_interval": "30s"
  }
}

3. Use doc_values and avoid fielddata. For aggregations and sorting on text fields, use the keyword type or multi-field mapping. Never enable fielddata on analyzed text fields - it loads the entire inverted index into heap memory.

4. Profile slow queries. Enable slow log to catch queries that degrade cluster performance:

PUT logs-*/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s",
  "index.search.slowlog.threshold.fetch.warn": "1s"
}

5. Disable swapping entirely. Swapped JVM heap is a death sentence for Elasticsearch performance. Configure this at the OS level:

# /etc/sysctl.conf
vm.swappiness = 1

# elasticsearch.yml
bootstrap.memory_lock: true

Monitoring and Alerting

At minimum, monitor these metrics (Prometheus with elasticsearch_exporter, Datadog, or Elastic's own monitoring):

  • Cluster health - yellow means missing replicas, red means missing primaries
  • JVM heap pressure - alert at 75%, panic at 85%
  • Disk watermark - ES stops allocating at 85% (low), marks read-only at 95% (flood)
  • Thread pool rejections - search, write, and bulk thread pool rejections indicate saturation
  • GC pause time - young GC pauses over 100ms or old GC pauses over 1s are problems
# Quick health check script
curl -s localhost:9200/_cluster/health | jq '{
  status,
  number_of_nodes,
  active_primary_shards,
  relocating_shards,
  unassigned_shards
}'

# Thread pool rejections
curl -s localhost:9200/_cat/thread_pool/search,write?v\&h=node_name,name,active,rejected,completed

Need Help with Your DevOps?

Running Elasticsearch in production requires ongoing attention to cluster health, capacity planning, and performance optimization. At InstaDevOps, we help startups and SMBs manage complex infrastructure like Elasticsearch clusters without the cost of a full-time hire - starting at $2,999/mo.

Book a free 15-minute consultation to discuss your Elasticsearch challenges.

Ready to Transform Your DevOps?

Get started with InstaDevOps and experience world-class DevOps services.

Book a Free Call

Never Miss an Update

Get the latest DevOps insights, tutorials, and best practices delivered straight to your inbox. Join 500+ engineers leveling up their DevOps skills.

We respect your privacy. Unsubscribe at any time. No spam, ever.