High Throughput Deployment Guide

Configure TraceCraft for high-volume production environments.

Overview

This guide covers optimizations for applications processing thousands of LLM requests per minute.

Key Optimizations

1. Use Buffering Exporter

Buffer traces before sending to reduce network overhead:

from tracecraft.exporters.otlp import OTLPExporter
from tracecraft.exporters.retry import BufferingExporter, RetryingExporter
from tracecraft.exporters.rate_limited import RateLimitedExporter
 
# Base OTLP exporter
otlp = OTLPExporter(
    endpoint="otel-collector:4317",
    service_name="my-app",
    timeout_ms=5000  # Short timeout for high throughput
)
 
# Add buffering (batch 100 traces before sending)
buffered = BufferingExporter(otlp, buffer_size=100)
 
# Add retry for resilience
retrying = RetryingExporter(
    buffered,
    max_retries=2,
    base_delay_ms=50,
    max_delay_ms=1000
)
 
# Add rate limiting to prevent overwhelming collector
rate_limited = RateLimitedExporter(
    retrying,
    rate=1000.0,  # 1000 exports/second max
    burst=100,
    blocking=False  # Drop instead of block
)
 
tracecraft.init(exporters=[rate_limited])

2. Configure Sampling

For very high traffic, sample traces:

from tracecraft.core.config import TraceCraftConfig, SamplingConfig
 
config = TraceCraftConfig(
    sampling=SamplingConfig(
        rate=0.1,  # Sample 10% of traces
        always_sample_errors=True  # But always sample errors
    )
)
 
tracecraft.init(config=config)

3. Async Context Propagation

Use async helpers for concurrent operations:

from tracecraft.contrib import gather_with_context, create_task_with_context
 
async def process_batch(queries: list[str]) -> list[str]:
    # All tasks maintain trace context
    tasks = [
        create_task_with_context(process_single(q))
        for q in queries
    ]
    return await gather_with_context(*tasks)

4. Disable Console Output

Console output adds latency:

tracecraft.init(
    console=False,  # Disable console
    jsonl=False,    # Disable local file
    exporters=[otlp_exporter]  # Only OTLP
)

5. Minimize Captured Data

Reduce payload size:

from tracecraft.core.config import TraceCraftConfig
 
config = TraceCraftConfig(
    # Truncate large inputs/outputs
    max_input_length=1000,
    max_output_length=1000,
    # Don't capture full prompts in production
    capture_prompts=False
)

Architecture for Scale

┌─────────────────────────────────────────────────────────────────┐
│                    Application Tier                              │
│                                                                  │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │  Pod 1  │  │  Pod 2  │  │  Pod 3  │  │  Pod N  │            │
│  │(sampled)│  │(sampled)│  │(sampled)│  │(sampled)│            │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘            │
│       └────────────┼────────────┼────────────┘                  │
│                    ▼                                             │
│            ┌──────────────┐                                      │
│            │   Headless   │                                      │
│            │   Service    │                                      │
│            └──────┬───────┘                                      │
└───────────────────┼─────────────────────────────────────────────┘
                    ▼
┌───────────────────────────────────────────────────────────────────┐
│                   Collector Tier (HPA Enabled)                    │
│                                                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐               │
│  │  Collector  │  │  Collector  │  │  Collector  │               │
│  │  (batching) │  │  (batching) │  │  (batching) │               │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘               │
│         └────────────────┼────────────────┘                       │
│                          ▼                                        │
│                  ┌──────────────┐                                 │
│                  │    Kafka     │                                 │
│                  │   (buffer)   │                                 │
│                  └──────┬───────┘                                 │
│                         ▼                                         │
│                 ┌──────────────┐                                  │
│                 │   Backend    │                                  │
│                 │(Elasticsearch│                                  │
│                 │   /Jaeger)   │                                  │
│                 └──────────────┘                                  │
└───────────────────────────────────────────────────────────────────┘

OTEL Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16
 
processors:
  batch:
    timeout: 1s
    send_batch_size: 10000
    send_batch_max_size: 20000
 
  memory_limiter:
    check_interval: 1s
    limit_mib: 4000
    spike_limit_mib: 500
 
  probabilistic_sampler:
    sampling_percentage: 10
 
exporters:
  kafka:
    brokers:
      - kafka:9092
    topic: traces
    encoding: otlp_proto
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, probabilistic_sampler]
      exporters: [kafka]

Benchmarks

Tested on 8-core, 32GB RAM, Kubernetes cluster:

Configuration	Throughput	Latency (p99)	Drop Rate
Baseline (no tracing)	10,000 req/s	50ms	0%
TraceCraft (console)	8,500 req/s	65ms	0%
TraceCraft (OTLP)	9,200 req/s	55ms	0%
TraceCraft (buffered)	9,800 req/s	52ms	0%
TraceCraft (sampled 10%)	9,900 req/s	51ms	0%

Monitoring the Tracing System

Metrics to Watch

# Access exporter stats
from tracecraft.exporters.rate_limited import RateLimitedExporter
 
rate_limited = RateLimitedExporter(...)
 
# Check dropped count
print(f"Dropped traces: {rate_limited.dropped_count}")

Prometheus Metrics

Configure OTEL Collector to expose metrics:

extensions:
  zpages:
    endpoint: 0.0.0.0:55679
 
service:
  extensions: [zpages]

Troubleshooting

High Memory Usage

Reduce buffer size
Enable aggressive sampling
Check for memory leaks in custom processors

Dropped Traces

Increase rate limit
Scale collector horizontally
Use Kafka as buffer

High Latency

Disable console output
Use async exports
Reduce captured data size

Kubernetes Overview