Why wouldn't I just hire a DevOps full time?

We get this question a lot. Happy to answer. For starters, a high senior level DevOps knows his worth and can easily make $300,000 per year (not to mention how hard is to find one). Aside from that, you would be paying him even when you don't have a lot of work for him while here, you would just cancel the subscribtion. There is also the quesiton of paying all the adds for full time DevOps like retirement, healthcare etc. This way it's one subscribtion and you are done.

Is there a limit on how many requests I can have at the same time?

Once you become a subscriber, you can add as many requests (tasks) as you want in the queue and depending on the plan you've chosen they will be delivered one by one or two at once.

How fast can I expect my requests to be completed?

On average most requests are completed in just two days or less. However, more complex requests could either take a little bit longer or be split into multiple requests.

Which clouds do you support?

InstaDevOps is only supporting AWS (Amazin) right now due to massive client preference for AWS over Azure or GCP

How do I request DevOps tasks?

InstaDevOps is very flexible when it comes to this. You will be doing all the requests inside your own private Trello board but when it comes to giving and sharing certain sensitive files, keys and accesses, InstaDevOps will be providing you with secured, encrypted and burnable options to ensure everything stays safe.

What if I am not satisfied with the result?

InstaDevOps is confident that you will be happy with delivery but of course in case we don't get it right first time, we will continue to revise untill we are sure you are 100% satisfied.

Is there any type of work that you don't cover?

InstaDevOps is making sure to work only with globally available tech stacks and tools. Using internal company tools would be hard for us as it would be the very first time using them. Also we try to avoid GCP and Azure since 90% of our clients are on AWS.

Can this service be customised to fit my company's specific needs?

Yes, InstaDevOps is opened to adapt to your custom company needs as part of Enterprise subscribtion plan. For more information please book a call with us. Looking forward hearing from you.

What if I have only one single request for you?

That is completely fine, you can let us finish that request and pause your subscribtion and return once you have more work for us. The remaining days of your subscribtion will be saved for the future and you won't be charged again untill you use those days.

Do you accept payments in crypto?

At InstaDevOps we tend to be opened to different solutions and adaptations to make sure our subscribers can get what they want. Since we have a lot of experience in WEB3 industry, we will be happy to accept crypto transactions as well. Please contact us at hello@instadevops.com with your query and let's get started today!

Are there refunds if I don't like the service?

InstaDevOps is confident that you will love the service. Sadly, we do not offer refunds due to high quality nature of work and time spent on the requests, therefore no refunds will be issued.

Elasticsearch Operations Guide: Cluster Sizing, ILM, and Performance Tuning

Introduction

Elasticsearch is one of those tools that is easy to set up but notoriously difficult to operate at scale. A basic single-node cluster handles your first million documents fine, then things start breaking: slow queries, disk pressure alerts, unbalanced shards, and the dreaded cluster yellow or red status at 3 AM.

This guide covers the operational side of Elasticsearch - the things you learn after running it in production for years. We will go through cluster sizing, index lifecycle management (ILM), shard allocation strategies, snapshot and restore procedures, and performance tuning techniques that actually matter.

Cluster Sizing and Hardware Planning

The most common mistake in Elasticsearch sizing is starting with too few nodes and too little memory. Here are the numbers that matter:

Memory: Elasticsearch lives and dies by its JVM heap and filesystem cache. The rule of thumb: give Elasticsearch 50% of available RAM as JVM heap (capped at 31GB to stay under compressed oops threshold), and leave the other 50% for the OS filesystem cache.

# jvm.options
-Xms16g
-Xmx16g
# Never set heap above 31g - you lose compressed ordinary object pointers

For a production logging cluster handling ~100GB/day of logs:

Role	Nodes	CPU	RAM	Storage
Master-eligible	3	2 vCPU	8 GB	50 GB SSD
Data (hot)	3	8 vCPU	64 GB	2 TB NVMe SSD
Data (warm)	2	4 vCPU	32 GB	8 TB HDD
Coordinating	2	4 vCPU	16 GB	100 GB SSD

Dedicated master nodes are non-negotiable for any production cluster. Master nodes handle cluster state, shard allocation, and index creation. If your master nodes are also serving queries and indexing data, cluster instability follows.

# elasticsearch.yml - dedicated master node
node.roles: [master]
cluster.name: production-logs
node.name: master-1

# elasticsearch.yml - hot data node
node.roles: [data_hot, data_content, ingest]
node.attr.data_tier: hot
cluster.name: production-logs
node.name: hot-data-1

AWS-specific sizing: On EC2, use r6g instances for data nodes (memory-optimized, Graviton for cost savings) and m6g for master/coordinating nodes. Use gp3 EBS volumes - they offer 3000 baseline IOPS regardless of volume size, which is more cost-effective than gp2 for most Elasticsearch workloads.

Index Lifecycle Management (ILM)

ILM automates the progression of indices through hot, warm, cold, and delete phases. Without ILM, you end up with hundreds of indices eating resources forever, or someone writes a fragile cron job to delete old indices.

Here is a production ILM policy for application logs:

PUT _ilm/policy/logs-lifecycle
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Apply the policy to an index template:

PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-lifecycle",
      "index.lifecycle.rollover_alias": "logs-write",
      "index.routing.allocation.require.data": "hot"
    }
  }
}

Key ILM tuning tips:

Rollover on shard size (50GB max primary shard) rather than document count - shard size is what affects performance
Force merge to 1 segment in the warm phase to reclaim disk and improve query performance on read-only indices
Shrink indices in warm phase if you started with multiple shards for write throughput but no longer need them

Shard Allocation and Rebalancing

Sharding is where most Elasticsearch performance problems originate. The guidelines are straightforward but frequently violated:

Rule 1: Keep shards between 10GB and 50GB. Shards under 1GB waste resources on overhead. Shards over 50GB make recovery slow and rebalancing painful.

Rule 2: Aim for fewer than 20 shards per GB of heap. A node with 16GB heap should host no more than ~300 shards. Each shard consumes memory for metadata, field data, and segment info regardless of whether it is being queried.

Check your current shard distribution:

# Shard count per node
curl -s localhost:9200/_cat/allocation?v

# Shards sorted by size
curl -s localhost:9200/_cat/shards?v\&s=store:desc

# Identify hot spots - nodes with too many shards
curl -s localhost:9200/_cat/nodes?v\&h=name,heap.percent,disk.used_percent,shards

When shards are unevenly distributed, you can adjust allocation awareness:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": "us-east-1a,us-east-1b,us-east-1c"
  }
}

This ensures that primary and replica shards land in different availability zones, giving you zone-failure resilience.

Snapshot and Restore

Snapshots are your safety net. Without them, a bad mapping change, accidental index deletion, or cluster corruption means data loss. Set up automated snapshots to S3:

PUT _snapshot/s3-backup
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-snapshots",
    "region": "us-east-1",
    "base_path": "production",
    "max_snapshot_bytes_per_sec": "200mb",
    "max_restore_bytes_per_sec": "200mb"
  }
}

Create a snapshot lifecycle management (SLM) policy:

PUT _slm/policy/nightly-backup
{
  "schedule": "0 0 2 * * ?",
  "name": "<nightly-snap-{now/d}>",
  "repository": "s3-backup",
  "config": {
    "indices": ["logs-*", "metrics-*", "app-*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 7,
    "max_count": 30
  }
}

Test your restores regularly. A snapshot you have never restored is a backup you do not have. Schedule a quarterly restore drill:

# Restore a specific index to verify backup integrity
curl -X POST "localhost:9200/_snapshot/s3-backup/nightly-snap-2026.04.01/_restore" \
  -H 'Content-Type: application/json' -d '{
    "indices": "logs-2026.03.31",
    "rename_pattern": "(.+)",
    "rename_replacement": "restored-$1"
  }'

# Verify document count matches
curl -s "localhost:9200/restored-logs-2026.03.31/_count"
curl -s "localhost:9200/logs-2026.03.31/_count"

Performance Tuning

Here are the performance levers that have the biggest real-world impact, ranked by effort-to-benefit ratio:

1. Use bulk indexing, always. Single-document indexing is orders of magnitude slower. Aim for bulk requests of 5-15MB.

# Bad: indexing one document at a time
curl -X POST "localhost:9200/logs/_doc" -d '{"message": "log entry"}'

# Good: bulk indexing
curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @bulk-data.ndjson

2. Tune refresh interval for write-heavy indices. The default 1-second refresh creates a new Lucene segment every second, which is expensive. For logging use cases where near-real-time search is not critical:

PUT logs-write/_settings
{
  "index": {
    "refresh_interval": "30s"
  }
}

3. Use doc_values and avoid fielddata. For aggregations and sorting on text fields, use the keyword type or multi-field mapping. Never enable fielddata on analyzed text fields - it loads the entire inverted index into heap memory.

4. Profile slow queries. Enable slow log to catch queries that degrade cluster performance:

PUT logs-*/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s",
  "index.search.slowlog.threshold.fetch.warn": "1s"
}

5. Disable swapping entirely. Swapped JVM heap is a death sentence for Elasticsearch performance. Configure this at the OS level:

# /etc/sysctl.conf
vm.swappiness = 1

# elasticsearch.yml
bootstrap.memory_lock: true

Monitoring and Alerting

At minimum, monitor these metrics (Prometheus with elasticsearch_exporter, Datadog, or Elastic's own monitoring):

Cluster health - yellow means missing replicas, red means missing primaries
JVM heap pressure - alert at 75%, panic at 85%
Disk watermark - ES stops allocating at 85% (low), marks read-only at 95% (flood)
Thread pool rejections - search, write, and bulk thread pool rejections indicate saturation
GC pause time - young GC pauses over 100ms or old GC pauses over 1s are problems

# Quick health check script
curl -s localhost:9200/_cluster/health | jq '{
  status,
  number_of_nodes,
  active_primary_shards,
  relocating_shards,
  unassigned_shards
}'

# Thread pool rejections
curl -s localhost:9200/_cat/thread_pool/search,write?v\&h=node_name,name,active,rejected,completed

Need Help with Your DevOps?

Running Elasticsearch in production requires ongoing attention to cluster health, capacity planning, and performance optimization. At InstaDevOps, we help startups and SMBs manage complex infrastructure like Elasticsearch clusters without the cost of a full-time hire - starting at $2,999/mo.

Book a free 15-minute consultation to discuss your Elasticsearch challenges.