Troubleshooting
This guide covers the most common problems encountered when installing, configuring, or running TraceCraft. Each issue includes the root cause and a concrete solution.
If your problem is not listed here, check the FAQ or open an issue on GitHub.
Installation Issues
ModuleNotFoundError for an adapter or exporter
Problem
ModuleNotFoundError: No module named 'tracecraft.adapters.langchain'Cause
TraceCraft uses optional dependency groups (extras) to keep the base package lightweight. Adapters and some exporters are not installed unless you request them explicitly.
Solution
Install the extra that provides the module you need:
pip install "tracecraft[langchain]" # LangChain adapter
pip install "tracecraft[llamaindex]" # LlamaIndex adapter
pip install "tracecraft[pydantic-ai]" # PydanticAI adapter
pip install "tracecraft[claude-sdk]" # Claude SDK adapter
pip install "tracecraft[otlp]" # OTLP exporter
pip install "tracecraft[mlflow]" # MLflow exporter
pip install "tracecraft[receiver]" # OpenTelemetry receiver
pip install "tracecraft[auto]" # Auto-instrumentation
pip install "tracecraft[all]" # All extrasPrevention
Add the required extra to your pyproject.toml or requirements.txt so it is always
installed in every environment:
[project]
dependencies = [
"tracecraft[langchain,otlp]",
]OpenTelemetry version conflict
Problem
ImportError: cannot import name 'SpanExporter' from 'opentelemetry.sdk.trace.export'
AttributeError: module 'opentelemetry.trace' has no attribute 'use_span'Cause
TraceCraft targets the OpenTelemetry Python SDK >=1.20. An older version installed by
another package in your environment is taking precedence.
Solution
Pin or upgrade to a compatible version:
pip install "opentelemetry-sdk>=1.20" "opentelemetry-api>=1.20"
pip install --upgrade "tracecraft[otlp]"Check for conflicting packages:
pip show opentelemetry-sdk opentelemetry-api
pip list | grep opentelemetryIf a framework dependency forces an older version, use a virtual environment or a dependency resolver override:
[tool.uv.override-dependencies]
opentelemetry-sdk = ">=1.20"Prevention
Use a lockfile (uv.lock, requirements.txt from pip-compile) and run dependency
audits (pip check) in CI to catch version conflicts early.
Platform-specific build failures on Apple Silicon or Windows
Problem
error: Failed building wheel for grpcio
error: Microsoft Visual C++ 14.0 or greater is requiredCause
The grpcio package (required by tracecraft[otlp]) has native extensions that must be
compiled if a pre-built wheel is not available for your platform and Python version.
Solution
Install the Xcode command-line tools and use a recent pip that can fetch pre-built wheels:
xcode-select --install
pip install --upgrade pip
pip install "tracecraft[otlp]"If wheels are still unavailable, install grpcio with system OpenSSL:
GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 pip install grpcioPrevention
Pin grpcio to a version for which pre-built wheels exist for your target platforms.
Check PyPI for available wheels before
upgrading.
Traces Not Appearing
No output to the console
Problem
Your decorated functions run without errors, but nothing appears in the terminal.
Cause
The console exporter is not enabled, or tracecraft.init() was not called before the
decorated functions ran.
Solution
Ensure tracecraft.init() is called at application startup, before any decorated
function is invoked:
import tracecraft
# Must be called before any @trace_* decorated function executes
tracecraft.init(console=True)
@tracecraft.trace_agent(name="my_agent")
def my_agent(prompt: str) -> str:
...
my_agent("Hello") # Trace now appears in the consoleInspect the active configuration to confirm the console exporter is active:
runtime = tracecraft.get_runtime()
print(runtime.config)Prevention
Call tracecraft.init() as early as possible - typically the first statement after
standard library imports in your entry point module.
OTLP backend is not receiving traces
Problem
Your application runs and tracecraft.init() succeeds, but the OTLP backend (Jaeger,
Grafana Tempo, etc.) shows no traces.
Cause
Common causes include a wrong endpoint URL, missing TLS configuration, a firewall blocking the collector port, or the backend process not yet running.
Solution
-
Verify the endpoint is reachable from the application host:
# HTTP/Protobuf endpoint curl -v http://localhost:4318/v1/traces # gRPC endpoint - check the port is open nc -zv localhost 4317 -
Enable debug logging to see what TraceCraft is attempting to send:
import logging logging.getLogger("tracecraft").setLevel(logging.DEBUG) -
Review the exporter configuration and confirm
insecure=Truefor local testing:from tracecraft.exporters import OTLPExporter tracecraft.init( exporters=[ OTLPExporter( endpoint="http://localhost:4317", insecure=True, ) ] ) -
Add a JSONL fallback exporter so traces are not lost while debugging:
from tracecraft.exporters import OTLPExporter, JSONLExporter tracecraft.init( exporters=[ OTLPExporter(endpoint="http://localhost:4317"), JSONLExporter(filepath="debug-traces.jsonl"), ] )
Prevention
Add a health check in your startup code that verifies the OTLP endpoint is reachable before the application begins serving requests.
TUI shows no traces after running the OTLP receiver
Problem
You configured the OTel receiver (tracecraft.otel.setup_exporter) and ran your
application, but tracecraft tui shows an empty trace list.
Cause
The receiver writes traces to a JSONL file or SQLite database. The TUI must be pointed at the same file path that the receiver is writing to.
Solution
Check the receiver configuration for the output path and pass that same path to the TUI:
from tracecraft.otel import setup_exporter
tracer = setup_exporter(
endpoint="http://localhost:4318",
service_name="my-agent",
output_path="traces/receiver.jsonl", # Note this path
)tracecraft tui traces/receiver.jsonl # Use the same pathIf no output_path was specified, check the default path documented in setup_exporter:
from tracecraft.otel import setup_exporter
help(setup_exporter)Prevention
Always specify an explicit output_path in setup_exporter so the path is unambiguous
in both your application code and your TUI launch command.
Trace hierarchy is flat - child steps are missing
Problem
The TUI shows the top-level agent step, but LLM calls, tool invocations, and retrieval steps that happen inside it are missing or appear as separate root traces.
Cause
The contextvars.ContextVar that carries the current AgentRun and parent Step is
not propagating into the child execution context. This happens when:
- Child functions run in separate threads (
ThreadPoolExecutor). - Child functions run in separate processes (
ProcessPoolExecutor). - Framework callbacks are invoked in a different context (use the framework-specific adapter instead of plain decorators).
- Steps are started in
asynciotasks that were created before the parent step opened.
Solution
For threads, copy the context explicitly before submitting work:
import contextvars
from concurrent.futures import ThreadPoolExecutor
ctx = contextvars.copy_context()
with ThreadPoolExecutor() as pool:
future = pool.submit(ctx.run, my_traced_function, arg)For async tasks, create them inside a traced function so they inherit its context:
@trace_agent(name="coordinator")
async def coordinator(tasks: list[str]) -> list[str]:
# Tasks created here inherit the coordinator's context
return await asyncio.gather(*[worker(t) for t in tasks])
@trace_tool(name="worker")
async def worker(task: str) -> str:
...For framework integrations, always use the TraceCraft adapter:
from tracecraft.adapters.langchain import TraceCraftCallbackHandler
handler = TraceCraftCallbackHandler()
result = chain.invoke(input, config={"callbacks": [handler]})Prevention
Use the framework-specific adapters for LangChain, LlamaIndex, and PydanticAI. For
custom threading code, always propagate context with contextvars.copy_context().run().
Integration Problems
LangChain callbacks are not firing
Problem
You attached TraceCraftCallbackHandler to a chain, but the handler methods are never
called and no steps appear in the trace.
Cause
The handler must be passed as part of the config dictionary at invocation time. Adding
it only to a component at construction time does not guarantee it propagates to all steps
in the chain.
Solution
Pass the handler inside config at .invoke() time:
from tracecraft.adapters.langchain import TraceCraftCallbackHandler
handler = TraceCraftCallbackHandler()
# Correct: config is forwarded to every step in the chain
result = chain.invoke(
{"input": "Hello"},
config={"callbacks": [handler]},
)The same applies to .ainvoke(), .stream(), and .astream().
Prevention
Always pass config={"callbacks": [handler]} at the top-level invocation. Consult the
LangChain documentation to verify that the specific component you are using forwards the
config dict to its children.
Auto-instrumentation patches are not applied
Problem
After calling tracecraft.init(auto_instrument=True), OpenAI or Anthropic API calls do
not appear in traces.
Cause
The openai (or anthropic) module was already imported before tracecraft.init() was
called. The patch cannot be applied retroactively to a module that is already loaded.
Solution
Restructure your entry point so patching happens before any other imports:
# 1. Initialize the runtime and patch SDKs in one step
import tracecraft
tracecraft.init(console=True, auto_instrument=True)
# 2. Now import the rest of the application
import openai
from my_app import run_agent
run_agent()If openai is imported at module level in many files, create a dedicated
instrumentation.py that is imported first in main.py.
Prevention
Treat tracecraft.init(auto_instrument=True) like logging setup - it must be the
very first line of your application entry point, before any application code is
imported.
Duplicate spans appearing in traces
Problem
Each LLM call appears twice in the trace - once from a @trace_llm decorator and once
from the auto-instrumentation or an adapter.
Cause
Both a @trace_llm decorator on a wrapper function and auto-instrumentation are active,
so the same underlying API call is captured by two separate mechanisms.
Solution
Choose one instrumentation strategy per code path:
# Option A: Use @trace_llm decorator, disable auto_instrument for that SDK
@trace_llm(name="gpt4_call", model="gpt-4o", provider="openai")
async def call_gpt4(prompt: str) -> str:
response = await client.chat.completions.create(...)
return response.choices[0].message.content
# Option B: Use auto-instrumentation, remove the decorator
async def call_gpt4(prompt: str) -> str:
response = await client.chat.completions.create(...) # Captured automatically
return response.choices[0].message.contentTo enable auto-instrumentation for one SDK only:
import tracecraft
tracecraft.init(auto_instrument=["anthropic"]) # Only Anthropic, not OpenAIPrevention
Document which instrumentation strategy is used for each SDK in your codebase. Avoid mixing decorators and auto-instrumentation for the same API.
Steps appear with the wrong StepType
Problem
Retrieval operations appear as TOOL steps, or agent orchestration appears as LLM
steps in the TUI.
Cause
The wrong decorator was applied, or a framework adapter mapped a callback to a different type than expected.
Solution
Use the decorator that matches the semantic role of the function:
from tracecraft import trace_agent, trace_tool, trace_llm, trace_retrieval
# Orchestration / decision-making
@trace_agent(name="planner")
async def planner(goal: str) -> list[str]: ...
# External tool calls
@trace_tool(name="web_search")
async def web_search(query: str) -> list[str]: ...
# LLM API calls
@trace_llm(name="summarizer", model="gpt-4o", provider="openai")
async def summarize(text: str) -> str: ...
# Vector search / document retrieval
@trace_retrieval(name="vector_search")
async def vector_search(query: str) -> list[str]: ...To set a type that does not have a dedicated decorator, use the step() context manager:
from tracecraft import step
from tracecraft.core.models import StepType
with step("quality_check", type=StepType.EVALUATION) as s:
result = run_evaluator(output, expected)
s.outputs["score"] = result.scorePrevention
Review Core Concepts for the definition of each
StepType and the decorator that corresponds to it.
Performance Issues
High memory usage
Problem
Application memory grows steadily over time when processing many agent tasks.
Cause
The export queue is filling up faster than it is being flushed. This occurs when the OTLP backend is slow or unreachable, causing in-memory batches to accumulate.
Solution
Reduce the queue size limit and flush more frequently:
import os
from tracecraft.exporters import AsyncBatchExporter, OTLPExporter
exporter = AsyncBatchExporter(
exporter=OTLPExporter(endpoint=os.getenv("OTLP_ENDPOINT")),
batch_size=50, # Flush more often
max_queue_size=200, # Drop oldest traces when the queue is full
flush_interval_ms=2000, # Flush every 2 seconds
)
tracecraft.init(
sampling_rate=0.05,
exporters=[exporter],
)Monitor queue health:
runtime = tracecraft.get_runtime()
print("Export errors:", runtime.export_errors)
print("Traces dropped:", runtime.traces_dropped)Prevention
Always set max_queue_size explicitly so the queue cannot grow unbounded. Match
sampling_rate to your backend’s ingestion capacity.
Slow application startup
Problem
Application startup is noticeably slower after adding tracecraft.init().
Cause
Importing optional extras, particularly grpcio for the OTLP exporter, can add
100-300 ms to import time because of native extension loading.
Solution
Defer tracecraft.init() until after the application framework has finished its own
startup, or use only the built-in exporters (console, JSONL) which have no native
dependencies:
# Fast startup: no native extensions
tracecraft.init(
jsonl=True,
jsonl_path="/tmp/traces/",
console=False,
)Measure import overhead before and after adding extras:
python -X importtime -c "import tracecraft" 2>&1 | grep tracecraftPrevention
Install only the extras you use. Avoid tracecraft[all] in production images where
startup time is critical.
High CPU usage from PII redaction
Problem
CPU usage spikes significantly when enable_pii_redaction=True is set alongside a high
sampling rate.
Cause
RedactionProcessor runs regex patterns against every string value in every trace
attribute. At 100% sampling with many patterns, this scales linearly with trace volume.
Solution
Switch to ProcessorOrder.EFFICIENCY so redaction only runs on traces that survive
sampling, and reduce the number of custom patterns:
from tracecraft.core.config import ProcessorOrder
tracecraft.init(
enable_pii_redaction=True,
processor_order=ProcessorOrder.EFFICIENCY, # Sample first, then redact
sampling_rate=0.1,
)Security trade-off
EFFICIENCY mode skips redaction on traces that are sampled out. Only use it when
sampled-out traces are discarded immediately and never written to any persistent
storage.
Limit which step attributes are scanned by specifying redact_keys:
from tracecraft.processors.redaction import RedactionProcessor
redaction = RedactionProcessor(
redact_keys=["inputs", "outputs"], # Only scan these step attributes
)
tracecraft.init(processors=[redaction])Prevention
Profile redaction overhead in a load test before enabling it in production. Start with the built-in patterns and add custom patterns one at a time.
Export Failures
Connection refused when exporting to OTLP
Problem
grpc._channel._InactiveRpcError: StatusCode.UNAVAILABLE
Connection refused: localhost:4317Cause
The OTLP collector or backend is not running, or the application is attempting to connect before the collector is ready.
Solution
-
Confirm the collector is listening:
nc -zv localhost 4317 # gRPC port nc -zv localhost 4318 # HTTP port -
Add retry logic with exponential backoff:
from tracecraft.exporters import RetryingExporter, OTLPExporter exporter = RetryingExporter( exporter=OTLPExporter(endpoint="http://localhost:4317"), max_retries=5, backoff_factor=2.0, # Waits: 1s, 2s, 4s, 8s, 16s ) tracecraft.init(exporters=[exporter]) -
Add a JSONL fallback so no traces are lost during the outage:
from tracecraft.exporters import RetryingExporter, OTLPExporter, JSONLExporter tracecraft.init( exporters=[ RetryingExporter( exporter=OTLPExporter(endpoint="http://localhost:4317"), max_retries=3, ), JSONLExporter(filepath="/var/log/traces/fallback.jsonl"), ] )
Prevention
In Docker Compose or Kubernetes, use a health check or depends_on condition so the
application does not start until the collector reports healthy.
TLS handshake errors
Problem
ssl.SSLCertVerificationError: certificate verify failed
grpc._channel._InactiveRpcError: StatusCode.UNAVAILABLE: failed to connect to all addressesCause
The OTLP exporter is connecting to a TLS-enabled endpoint but the certificate is self-signed, expired, or the system CA bundle does not include the signing authority.
Solution
If the backend has a valid certificate from a public CA, update your CA bundle:
pip install --upgrade certifiPrevention
Use certificates from a trusted CA in production. Automate certificate renewal with Let’s Encrypt or your cloud provider’s managed certificate service.
Traces lost on application crash
Problem
When the application is forcefully terminated (SIGKILL, OOM kill), the last batch of traces in the export queue is lost.
Cause
AsyncBatchExporter holds pending traces in memory and flushes on a timer or when the
batch is full. A hard kill signal does not allow the exporter time to flush.
Solution
Register a shutdown hook to flush all pending traces before the process exits:
import atexit
import signal
import tracecraft
tracecraft.init(otlp_endpoint=os.getenv("OTLP_ENDPOINT"))
def _flush_traces(signum=None, frame=None) -> None:
"""Flush all pending traces before the process exits."""
runtime = tracecraft.get_runtime()
runtime.shutdown(timeout=5.0)
atexit.register(_flush_traces)
signal.signal(signal.SIGTERM, _flush_traces)Add a JSONL fallback that is written synchronously per trace, so it survives a crash:
from tracecraft.exporters import OTLPExporter, JSONLExporter
tracecraft.init(
exporters=[
OTLPExporter(endpoint=os.getenv("OTLP_ENDPOINT")),
JSONLExporter(filepath="/var/log/traces/wal.jsonl"), # Written immediately
]
)The JSONL file can be replayed later with the TUI or a custom script.
Prevention
Handle SIGTERM gracefully in all production services. In Kubernetes, set
terminationGracePeriodSeconds long enough for the exporter to flush (10-30 seconds
is sufficient for most batch sizes).