Why wouldn't I just hire a DevOps full time?

We get this question a lot. Happy to answer. For starters, a high senior level DevOps knows his worth and can easily make $300,000 per year (not to mention how hard is to find one). Aside from that, you would be paying him even when you don't have a lot of work for him while here, you would just cancel the subscribtion. There is also the quesiton of paying all the adds for full time DevOps like retirement, healthcare etc. This way it's one subscribtion and you are done.

Is there a limit on how many requests I can have at the same time?

Once you become a subscriber, you can add as many requests (tasks) as you want in the queue and depending on the plan you've chosen they will be delivered one by one or two at once.

How fast can I expect my requests to be completed?

On average most requests are completed in just two days or less. However, more complex requests could either take a little bit longer or be split into multiple requests.

Which clouds do you support?

InstaDevOps is only supporting AWS (Amazin) right now due to massive client preference for AWS over Azure or GCP

How do I request DevOps tasks?

InstaDevOps is very flexible when it comes to this. You will be doing all the requests inside your own private Trello board but when it comes to giving and sharing certain sensitive files, keys and accesses, InstaDevOps will be providing you with secured, encrypted and burnable options to ensure everything stays safe.

What if I am not satisfied with the result?

InstaDevOps is confident that you will be happy with delivery but of course in case we don't get it right first time, we will continue to revise untill we are sure you are 100% satisfied.

Is there any type of work that you don't cover?

InstaDevOps is making sure to work only with globally available tech stacks and tools. Using internal company tools would be hard for us as it would be the very first time using them. Also we try to avoid GCP and Azure since 90% of our clients are on AWS.

Can this service be customised to fit my company's specific needs?

Yes, InstaDevOps is opened to adapt to your custom company needs as part of Enterprise subscribtion plan. For more information please book a call with us. Looking forward hearing from you.

What if I have only one single request for you?

That is completely fine, you can let us finish that request and pause your subscribtion and return once you have more work for us. The remaining days of your subscribtion will be saved for the future and you won't be charged again untill you use those days.

Do you accept payments in crypto?

At InstaDevOps we tend to be opened to different solutions and adaptations to make sure our subscribers can get what they want. Since we have a lot of experience in WEB3 industry, we will be happy to accept crypto transactions as well. Please contact us at hello@instadevops.com with your query and let's get started today!

Are there refunds if I don't like the service?

InstaDevOps is confident that you will love the service. Sadly, we do not offer refunds due to high quality nature of work and time spent on the requests, therefore no refunds will be issued.

Cloud-Native Observability: OpenTelemetry and Beyond

Introduction

Your application just slowed down. Users are complaining. The CEO is asking what's wrong. You have hundreds of microservices, thousands of containers, and millions of log lines. Where do you even start?

This is the observability problem. Traditional monitoring - checking if servers are up - isn't enough in cloud-native environments. You need to understand why your system is behaving a certain way, not just that something is wrong.

Observability is about instrumenting your systems to answer any question about their behavior. In this practical guide, we'll explore modern observability practices, focusing on OpenTelemetry as the industry standard for instrumentation.

The Three Pillars of Observability

1. Metrics

Numeric measurements over time:

CPU usage: 45%
Request rate: 1,250 req/sec
Error rate: 0.5%
P95 latency: 450ms
Active users: 12,450

Good for: Dashboards, alerts, trends Bad for: Understanding why something happened

2. Logs

Discrete events:

{
  "timestamp": "2024-01-15T10:23:45Z",
  "level": "ERROR",
  "service": "payment-api",
  "message": "Payment processing failed",
  "error": "Connection timeout to payment gateway",
  "user_id": 12345,
  "amount": 99.99
}

Good for: Debugging, understanding what happened Bad for: High-cardinality queries, correlation across services (for a detailed comparison of logging solutions, see Logging at Scale: ELK vs Loki vs CloudWatch)

3. Traces

Request journey through distributed system:

User Request → Frontend (50ms)
  ├─> API Gateway (5ms)
  │   ├─> Auth Service (20ms)
  │   ├─> Product Service (100ms)
  │   │   └─> Database Query (80ms) ← SLOW!
  │   └─> Inventory Service (30ms)
  └─> Response (Total: 205ms)

Good for: Understanding request flow, finding bottlenecks Bad for: Aggregation, high-level trends

Why Traditional Monitoring Fails

The Kubernetes Problem

Traditional Monitoring (Pre-Kubernetes):
- Fixed servers with static IPs
- Server metrics tell you what's wrong
- SSH to server, check logs
- Simple to debug

Kubernetes:
- Pods come and go every few minutes
- IP addresses change constantly
- Logs disappear when pod dies
- Which pod handled the failing request?
- Impossible to debug with traditional tools

The Microservices Problem

Monolith:
User Request → Application → Database
              (Easy to trace)

Microservices:
User Request → API Gateway
  ├─> Service A → Service B → Service C
  ├─> Service D → Service E
  └─> Service F → Service G → Service H → Service I

Question: "Why is this request slow?"
Traditional monitoring: Can't tell you
Observability: Shows exact bottleneck

OpenTelemetry: The Standard

What is OpenTelemetry?

OpenTelemetry (OTel) is a vendor-neutral, open-source standard for instrumenting applications to generate telemetry data (metrics, logs, traces).

Before OpenTelemetry:
- Proprietary agents for each vendor
- Vendor lock-in
- Different instrumentation for each tool

With OpenTelemetry:
- Single SDK for all telemetry
- Send to any backend
- Standardized across languages
- No vendor lock-in

Architecture

Application → OpenTelemetry SDK → OpenTelemetry Collector → Backend
                                        ↓
                                  (Process, filter, route)
                                        ↓
                            ┌───────────┴───────────┐
                            │                       │
                       Prometheus             Jaeger/Tempo
                       (Metrics)              (Traces)
                            │                       │
                       Grafana ←──────────────────┘
                     (Visualization)

Installing OpenTelemetry

Python:

# pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Set up tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export to collector
otlp_exporter = OTLPSpanExporter(
    endpoint="http://otel-collector:4317",
    insecure=True
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

app = Flask(__name__)

@app.route('/api/users/<user_id>')
def get_user(user_id):
    # Automatically traced!
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user.id", user_id)
        
        # Database query (auto-instrumented)
        user = db.query(User).filter(User.id == user_id).first()
        
        # External API call (auto-instrumented)
        orders = requests.get(f"http://order-service/users/{user_id}/orders")
        
        return jsonify({
            "user": user,
            "orders": orders.json()
        })

Node.js:

// npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// Your application code - automatically instrumented!
const express = require('express');
const app = express();

app.get('/api/users/:userId', async (req, res) => {
  // Auto-traced!
  const user = await User.findById(req.params.userId);
  const orders = await fetch(`http://order-service/users/${req.params.userId}/orders`);
  
  res.json({
    user,
    orders: await orders.json()
  });
});

Go:

// go get go.opentelemetry.io/otel

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func initTracer() {
    exporter, _ := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
    )
    
    otel.SetTracerProvider(tp)
}

func main() {
    initTracer()
    
    // Wrap HTTP handler for auto-tracing
    handler := http.HandlerFunc(getUserHandler)
    wrappedHandler := otelhttp.NewHandler(handler, "get-user")
    
    http.Handle("/api/users/", wrappedHandler)
    http.ListenAndServe(":8080", nil)
}

OpenTelemetry Collector

# otel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  
  # Add resource attributes
  resource:
    attributes:
    - key: environment
      value: production
      action: upsert
  
  # Sample traces (keep 10%)
  probabilistic_sampler:
    sampling_percentage: 10

exporters:
  # Export to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
  
  # Export to Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  # Export to Grafana Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  
  # Export to Loki (logs)
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, probabilistic_sampler]
      exporters: [jaeger, otlp/tempo]
    
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]
    
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

Distributed Tracing Deep Dive

Trace Context Propagation

How traces work across services:

1. Frontend receives request
   trace-id: abc123
   span-id: 001

2. Frontend calls Backend
   Headers: 
     traceparent: 00-abc123-001-01
   
3. Backend creates child span
   trace-id: abc123 (same!)
   span-id: 002 (new)
   parent-id: 001

4. Backend calls Database
   Headers:
     traceparent: 00-abc123-002-01

5. Database creates child span
   trace-id: abc123 (same!)
   span-id: 003 (new)
   parent-id: 002

Result: Full trace across all services!

Custom Spans

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@app.route('/api/checkout')
def checkout():
    # Parent span (auto-created by Flask instrumentation)
    
    with tracer.start_as_current_span("validate_cart") as span:
        span.set_attribute("cart.items", len(cart.items))
        validate_cart(cart)
    
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount", cart.total)
        span.set_attribute("payment.method", "credit_card")
        
        try:
            charge_id = process_payment(cart.total)
            span.set_attribute("payment.charge_id", charge_id)
            span.set_status(trace.Status(trace.StatusCode.OK))
        except PaymentError as e:
            span.set_status(
                trace.Status(
                    trace.StatusCode.ERROR,
                    str(e)
                )
            )
            span.record_exception(e)
            raise
    
    with tracer.start_as_current_span("create_order"):
        order = create_order(cart, charge_id)
        
    return {"order_id": order.id}

Sampling Strategies

# Head-based sampling (at span creation)

processors:
  # Always sample errors
  tail_sampling:
    policies:
    - name: errors
      type: status_code
      status_code:
        status_codes: [ERROR]
    
    # Sample 10% of successful requests
    - name: success
      type: probabilistic
      probabilistic:
        sampling_percentage: 10
    
    # Always sample slow requests (>1s)
    - name: slow
      type: latency
      latency:
        threshold_ms: 1000
    
    # Always sample specific endpoints
    - name: critical-endpoints
      type: string_attribute
      string_attribute:
        key: http.route
        values:
        - /api/checkout
        - /api/payment

Metrics with OpenTelemetry

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# Set up metrics
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))

meter = metrics.get_meter(__name__)

# Create metrics
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)

request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request duration"
)

active_users = meter.create_up_down_counter(
    "active_users",
    description="Currently active users"
)

# Use metrics
@app.route('/api/users')
def get_users():
    start = time.time()
    
    # Increment counter
    request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
    
    # Business logic
    users = User.query.all()
    
    # Record duration
    duration = time.time() - start
    request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})
    
    return jsonify(users)

@app.route('/api/login', methods=['POST'])
def login():
    # User logged in
    active_users.add(1)
    return {"status": "success"}

@app.route('/api/logout', methods=['POST'])
def logout():
    # User logged out
    active_users.add(-1)
    return {"status": "success"}

Observability Stack

The LGTM Stack (Grafana)

# Loki (Logs), Grafana (Visualization), Tempo (Traces), Mimir (Metrics)

version: '3.8'

services:
  # Grafana (Visualization)
  grafana:
    image: grafana/grafana:latest
    ports:
    - "3000:3000"
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
    volumes:
    - grafana-storage:/var/lib/grafana
  
  # Loki (Logs)
  loki:
    image: grafana/loki:latest
    ports:
    - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
  
  # Tempo (Traces)
  tempo:
    image: grafana/tempo:latest
    ports:
    - "3200:3200"  # UI
    - "4317:4317"  # OTLP gRPC
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
    - ./tempo.yaml:/etc/tempo.yaml
  
  # Mimir (Metrics) or Prometheus
  prometheus:
    image: prom/prometheus:latest
    ports:
    - "9090:9090"
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
  
  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
    - "4317:4317"  # OTLP gRPC
    - "4318:4318"  # OTLP HTTP
    - "8889:8889"  # Prometheus exporter
    volumes:
    - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    command: ["--config=/etc/otel-collector-config.yaml"]

Grafana Dashboard Example

{
  "dashboard": {
    "title": "Application Observability",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(http_requests_total[5m])"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])"
        }]
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Recent Errors",
        "type": "logs",
        "targets": [{
          "expr": "{level=\"error\"}"
        }]
      },
      {
        "title": "Trace Map",
        "type": "nodeGraph",
        "targets": [{
          "query": "traces"
        }]
      }
    ]
  }
}

Best Practices

1. Structured Logging

import structlog

logger = structlog.get_logger()

# Bad
logger.info(f"User {user_id} purchased {item_name} for ${amount}")

# Good
logger.info(
    "purchase_completed",
    user_id=user_id,
    item_name=item_name,
    amount=amount,
    payment_method=payment_method
)

2. Correlation IDs

import uuid

@app.before_request
def before_request():
    # Generate or extract correlation ID
    correlation_id = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))
    g.correlation_id = correlation_id
    
    # Add to logs
    structlog.contextvars.bind_contextvars(correlation_id=correlation_id)
    
    # Add to traces
    span = trace.get_current_span()
    span.set_attribute("correlation.id", correlation_id)

@app.after_request
def after_request(response):
    # Return correlation ID in response
    response.headers['X-Correlation-ID'] = g.correlation_id
    return response

3. SLI/SLO Monitoring

# Service Level Indicators/Objectives

SLI (Service Level Indicator): What we measure
- Request success rate
- Request latency P95
- Availability

SLO (Service Level Objective): Target
- 99.9% success rate
- P95 latency < 500ms
- 99.95% availability

Alert: When SLO at risk
- Success rate < 99.9% for 5 minutes
- P95 latency > 500ms for 5 minutes
- Error budget consumed > 80%

4. Cost Management

Observability can be expensive:

1. Sample aggressively
   - Keep 100% of errors
   - Sample 10% of successful requests
   - Sample 1% of health checks

2. Use tiered storage
   - Hot: Last 7 days (expensive, fast queries)
   - Warm: 8-30 days (cheaper, slower queries)
   - Cold: 31-90 days (cheapest, slowest)
   - Archive: >90 days (S3, rarely accessed)

3. Set retention policies
   - Traces: 30 days
   - Metrics: 90 days (1m resolution), 1 year (1h resolution)
   - Logs: 7 days (debug), 90 days (error)

Conclusion

Observability is essential for cloud-native applications. OpenTelemetry provides a vendor-neutral standard for instrumenting your applications, giving you the flexibility to choose backends while avoiding vendor lock-in.

Key takeaways:

Implement all three pillars: Metrics, logs, and traces together provide complete observability
Use OpenTelemetry: Industry standard, vendor-neutral, future-proof
Start simple: Auto-instrumentation first, custom spans later
Sample intelligently: Keep errors, sample successful requests
Correlate everything: Use trace IDs across metrics, logs, and traces

The investment in observability pays for itself the first time you debug a production issue in minutes instead of hours.