LlamaIndex Integration
TraceCraft integrates with LlamaIndex through the TraceCraftSpanHandler, a native LlamaIndex span
handler that captures every component invocation as a TraceCraft Step with full hierarchy, token
counts, and error details — without changing your existing LlamaIndex code.
Installation
pip install "tracecraft[llamaindex]"This installs TraceCraft with LlamaIndex support (llama-index-core>=0.10).
Quick Start
import tracecraft
from tracecraft.adapters.llamaindex import TraceCraftSpanHandler
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.callbacks import CallbackManager
from datetime import UTC, datetime
# Initialize TraceCraft
tracecraft.init(console=True)
# Attach the span handler to LlamaIndex settings
handler = TraceCraftSpanHandler()
Settings.callback_manager = CallbackManager(handlers=[handler])
# Wrap every session in an AgentRun context
run = AgentRun(name="llamaindex_rag", start_time=datetime.now(UTC))
with run_context(run):
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is TraceCraft?")
print(response)
# Free internal tracking state when done
handler.clear()run_context is required
TraceCraft creates steps only when an AgentRun is active. Always wrap LlamaIndex calls
inside run_context(run) or runtime.run("name").
How It Works
TraceCraftSpanHandler
TraceCraftSpanHandler extends LlamaIndex’s BaseSpanHandler. LlamaIndex calls new_span() when
a component starts and end_span() or drop_span() when it finishes. The handler:
- Inspects the component instance class name and module to infer a
StepType. - Creates a
Stepwithstart_time,inputs, and optionalmodel_name. - Attaches the
Stepto the activeAgentRun(or as a child of a parent span). - Fills in
end_time,duration_ms,outputs, andinput_tokens/output_tokenson completion.
Type Inference
The handler infers step types from the component being called:
| Component pattern | StepType |
|---|---|
Class/module name contains "llm" | LLM |
Class/module name contains "retriever" | RETRIEVAL |
Class/module name contains "tool" | TOOL |
Class/module name contains "agent" | AGENT |
Class name contains "query" or "engine" | WORKFLOW |
| Everything else | WORKFLOW |
Token Capture
For LLM steps, token counts are extracted from the raw response object:
response.raw["usage"]["prompt_tokens"] -> step.input_tokens
response.raw["usage"]["completion_tokens"] -> step.output_tokensStreaming
When LlamaIndex streams tokens, you can call handler.on_llm_stream(id_, chunk) to accumulate
chunks. The step’s is_streaming flag is set to True and streaming_chunks holds the
accumulated text.
Basic Examples
Simple Completion
import tracecraft
from tracecraft.adapters.llamaindex import TraceCraftSpanHandler
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager
from llama_index.llms.openai import OpenAI
from datetime import UTC, datetime
tracecraft.init()
handler = TraceCraftSpanHandler()
Settings.callback_manager = CallbackManager(handlers=[handler])
llm = OpenAI(model="gpt-4o-mini")
run = AgentRun(name="completion", start_time=datetime.now(UTC))
with run_context(run):
response = llm.complete("Explain LLM observability in one sentence.")
print(response.text)
handler.clear()Chat Completion
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
llm = OpenAI(model="gpt-4o-mini")
run = AgentRun(name="chat", start_time=datetime.now(UTC))
with run_context(run):
messages = [
ChatMessage(role="system", content="You are a helpful assistant."),
ChatMessage(role="user", content="What makes TraceCraft different?"),
]
response = llm.chat(messages)
print(response.message.content)Query Engines
Basic Query Engine
import tracecraft
from tracecraft.adapters.llamaindex import TraceCraftSpanHandler
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.callbacks import CallbackManager
from datetime import UTC, datetime
tracecraft.init()
handler = TraceCraftSpanHandler()
Settings.callback_manager = CallbackManager(handlers=[handler])
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
run = AgentRun(name="query_engine", start_time=datetime.now(UTC))
with run_context(run):
response = query_engine.query("Summarize the key findings.")
print(response)
handler.clear()Custom Query Engine Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.response_synthesizers import get_response_synthesizer
# Retriever with custom top-k
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=5,
)
# Response synthesizer with a specific mode
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
)
run = AgentRun(name="custom_query_engine", start_time=datetime.now(UTC))
with run_context(run):
response = query_engine.query("What are the main topics?")
print(response)TraceCraft captures the retriever step (type RETRIEVAL) and the synthesizer step (type LLM)
as children of the overall query workflow.
RAG Pipelines
Basic RAG
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.chunk_size = 512
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)
run = AgentRun(name="basic_rag", start_time=datetime.now(UTC))
with run_context(run):
response = query_engine.query(
"What are the installation steps for TraceCraft?"
)
print(response)
for node in response.source_nodes:
print(f"Source: {node.metadata.get('file_name')}, Score: {node.score:.3f}")Multi-Document RAG
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
docs_technical = SimpleDirectoryReader("docs/technical").load_data()
docs_user = SimpleDirectoryReader("docs/user_guide").load_data()
index = VectorStoreIndex.from_documents(docs_technical + docs_user)
query_engine = index.as_query_engine(similarity_top_k=5)
run = AgentRun(name="multi_doc_rag", start_time=datetime.now(UTC))
with run_context(run):
questions = [
"How does the exporter pipeline work?",
"What authentication methods are supported?",
"How do I configure sampling rates?",
]
for question in questions:
response = query_engine.query(question)
print(f"Q: {question}\nA: {response}\n")RAG with Reranking
from llama_index.core import VectorStoreIndex
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.response_synthesizers import get_response_synthesizer
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
retriever = VectorIndexRetriever(index=index, similarity_top_k=10)
reranker = SentenceTransformerRerank(
model="cross-encoder/ms-marco-MiniLM-L-2-v2",
top_n=3,
)
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[reranker],
response_synthesizer=get_response_synthesizer(response_mode="compact"),
)
run = AgentRun(name="rag_with_reranking", start_time=datetime.now(UTC))
with run_context(run):
response = query_engine.query("What is the recommended deployment strategy?")
print(response)TraceCraft captures retrieval, reranking, and synthesis as separate steps in the trace tree.
Chat Engines
Basic Chat Engine
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
chat_engine = index.as_chat_engine(chat_mode="best", verbose=False)
run = AgentRun(name="chat_session", start_time=datetime.now(UTC))
with run_context(run):
response = chat_engine.chat("What is TraceCraft?")
print(response)
response = chat_engine.chat("How does it compare to LangSmith?")
print(response)
response = chat_engine.chat("What exporters does it support?")
print(response)
handler.clear()Chat Engine with Memory
from llama_index.core.memory import ChatMemoryBuffer
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
chat_engine = index.as_chat_engine(
chat_mode="context",
memory=memory,
system_prompt=(
"You are a technical assistant for TraceCraft. "
"Answer questions based on the documentation."
),
)
run = AgentRun(name="chat_with_memory", start_time=datetime.now(UTC))
with run_context(run):
turns = [
"What is the purpose of TraceCraft?",
"What step types are available?",
"Can you give me an example using the TOOL step type?",
]
for turn in turns:
response = chat_engine.chat(turn)
print(f"User: {turn}")
print(f"Assistant: {response}\n")Agents
ReAct Agent
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
def multiply(a: float, b: float) -> float:
"""Multiply two numbers and return the result."""
return a * b
def search_docs(query: str) -> str:
"""Search the documentation for information about a topic."""
return f"Documentation results for '{query}': TraceCraft is an observability SDK for LLMs."
multiply_tool = FunctionTool.from_defaults(fn=multiply)
search_tool = FunctionTool.from_defaults(fn=search_docs)
llm = OpenAI(model="gpt-4o-mini")
agent = ReActAgent.from_tools(
[multiply_tool, search_tool],
llm=llm,
verbose=False,
max_iterations=10,
)
run = AgentRun(name="react_agent", start_time=datetime.now(UTC))
with run_context(run):
response = agent.query("What is 47 times 89? Also, what is TraceCraft?")
print(response)QueryEngineTool Agent
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
index_docs = VectorStoreIndex.from_documents(SimpleDirectoryReader("docs").load_data())
index_code = VectorStoreIndex.from_documents(SimpleDirectoryReader("src").load_data())
docs_tool = QueryEngineTool(
query_engine=index_docs.as_query_engine(),
metadata=ToolMetadata(
name="documentation",
description="Search the TraceCraft documentation.",
),
)
code_tool = QueryEngineTool(
query_engine=index_code.as_query_engine(),
metadata=ToolMetadata(
name="source_code",
description="Search the TraceCraft source code.",
),
)
agent = ReActAgent.from_tools([docs_tool, code_tool], verbose=False)
run = AgentRun(name="multi_tool_agent", start_time=datetime.now(UTC))
with run_context(run):
response = agent.query(
"How does TraceCraft's LangChain adapter capture token counts? "
"Show me the relevant source code."
)
print(response)Multi-Tool Agent
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
def get_weather(city: str) -> str:
"""Get current weather for a city."""
return f"Weather in {city}: 72F, partly cloudy."
def add_numbers(a: float, b: float) -> float:
"""Add two numbers together."""
return a + b
def lookup_company(name: str) -> str:
"""Return basic info about a company."""
db = {"tracecraft": "Open-source LLM observability SDK, founded 2024."}
return db.get(name.lower(), "Company not found.")
tools = [
FunctionTool.from_defaults(fn=get_weather),
FunctionTool.from_defaults(fn=add_numbers),
FunctionTool.from_defaults(fn=lookup_company),
]
agent = ReActAgent.from_tools(tools, llm=OpenAI(model="gpt-4o-mini"))
run = AgentRun(name="multi_tool_agent", start_time=datetime.now(UTC))
with run_context(run):
response = agent.query(
"What is the weather in Seattle? "
"Also, what is 123 plus 456? "
"Tell me about TraceCraft."
)
print(response)Streaming
Streaming Completions
import tracecraft
from tracecraft.adapters.llamaindex import TraceCraftSpanHandler
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager
from llama_index.llms.openai import OpenAI
from datetime import UTC, datetime
tracecraft.init()
handler = TraceCraftSpanHandler()
Settings.callback_manager = CallbackManager(handlers=[handler])
llm = OpenAI(model="gpt-4o-mini")
run = AgentRun(name="streaming_completion", start_time=datetime.now(UTC))
with run_context(run):
streaming_response = llm.stream_complete("Explain RAG in 3 sentences.")
for token in streaming_response:
print(token.delta, end="", flush=True)
print()
handler.clear()Streaming RAG
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True)
run = AgentRun(name="streaming_rag", start_time=datetime.now(UTC))
with run_context(run):
streaming_response = query_engine.query("Describe the TraceCraft architecture.")
streaming_response.print_response_stream()
handler.clear()Streaming and token counts
When streaming is enabled, LlamaIndex may not provide final token counts until the stream completes. TraceCraft captures whatever usage data is available in the response object.
Advanced Usage
Per-Run Configuration
Use run_context to assign metadata and group related queries under one trace:
import tracecraft
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
run = AgentRun(
name="nightly_report_generation",
start_time=datetime.now(UTC),
)
run.attributes["user_id"] = "user-789"
run.attributes["report_type"] = "weekly_summary"
with run_context(run):
for section in ["intro", "methodology", "results", "conclusion"]:
response = query_engine.query(f"Generate the {section} section of the report.")
print(response)
tracecraft.get_runtime().end_run(run)Multiple Indices
TraceCraft handles calls across multiple indices in the same run:
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
engine_a = index_a.as_query_engine()
engine_b = index_b.as_query_engine()
run = AgentRun(name="multi_index_query", start_time=datetime.now(UTC))
with run_context(run):
answer_a = engine_a.query("What does document set A say about caching?")
answer_b = engine_b.query("What does document set B say about caching?")
print(f"Set A: {answer_a}")
print(f"Set B: {answer_b}")Both queries appear as sibling steps inside the same run.
Custom Components
When you subclass a LlamaIndex component, TraceCraft infers the step type from the class and module
name. Ensure your class name contains a descriptive keyword (llm, retriever, tool, agent)
so the inference logic maps it correctly:
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import QueryBundle, NodeWithScore
class MyCustomRetriever(BaseRetriever):
"""Custom retriever — 'retriever' in the class name triggers RETRIEVAL step type."""
def _retrieve(self, query_bundle: QueryBundle) -> list[NodeWithScore]:
# Your custom retrieval logic
return []Best Practices
1. Initialize Once at Startup
# app.py
import tracecraft
from tracecraft.adapters.llamaindex import TraceCraftSpanHandler
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager
tracecraft.init(service_name="my-llamaindex-app")
handler = TraceCraftSpanHandler()
Settings.callback_manager = CallbackManager(handlers=[handler])Attach the handler to Settings once. All subsequent LlamaIndex calls pick it up automatically.
2. Always Use run_context
Every query should be wrapped in a run_context (or runtime.run()) to provide the AgentRun
that steps are attached to. Without it, spans are silently dropped:
runtime = tracecraft.get_runtime()
with runtime.run("user_query"):
response = query_engine.query(user_question)3. Call clear() After Each Session
The handler keeps in-flight spans in memory. Call handler.clear() after a session completes to
prevent unbounded memory growth, especially in long-running servers:
try:
with run_context(run):
response = query_engine.query(question)
finally:
handler.clear()4. Group Batch Queries Into One Run
Wrapping a batch of queries in a single run gives you aggregate token and error counts across the whole batch:
with runtime.run("batch_processing") as run:
for query in query_list:
engine.query(query)
# run.total_tokens reflects all queries combined5. Handle Errors Gracefully
When a LlamaIndex component raises an exception, drop_span() is called and the error is captured
in step.error automatically. Catch exceptions at the application level without losing trace data:
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
import logging
logger = logging.getLogger(__name__)
run = AgentRun(name="safe_query", start_time=datetime.now(UTC))
with run_context(run):
try:
response = query_engine.query(user_query)
except Exception as exc:
# The failed span is already recorded in the run
logger.error("Query failed: %s", exc)Troubleshooting
Spans Not Captured
Symptom: run.steps is empty after calling LlamaIndex.
Cause: The handler was not registered with Settings.callback_manager before the index or
engine was created, or no AgentRun was active in context.
Fix: Register the handler before building any index, and always use run_context:
# Register handler BEFORE creating indices or query engines
handler = TraceCraftSpanHandler()
Settings.callback_manager = CallbackManager(handlers=[handler])
# Build your index after registration
index = VectorStoreIndex.from_documents(documents)
# Wrap calls in a run context
run = AgentRun(name="my_run", start_time=datetime.now(UTC))
with run_context(run):
response = index.as_query_engine().query("Hello")Duplicate Traces
Symptom: Each query produces two identical runs.
Cause: Multiple handlers were attached, or tracecraft.init() was called more than once.
Fix: Create the handler and call tracecraft.init() once per process:
# In your application entry point — call these once only
tracecraft.init()
handler = TraceCraftSpanHandler()
Settings.callback_manager = CallbackManager(handlers=[handler])Missing Token Counts
Symptom: step.input_tokens and step.output_tokens are None for LLM steps.
Cause: The LLM provider does not populate response.raw["usage"].
Fix: Verify that your LLM returns usage data. For OpenAI:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini")
response = llm.complete("Hello")
# Should contain {"usage": {"prompt_tokens": ..., "completion_tokens": ...}}
print(response.raw)Memory Issues in Long-Running Servers
Symptom: Memory grows unboundedly over time.
Cause: handler.clear() is not called between requests, leaving completed spans in the
handler’s internal dictionary.
Fix: Always clear the handler after each request, using try/finally:
from tracecraft.core.context import run_context
from tracecraft.core.models import AgentRun
from datetime import UTC, datetime
async def handle_request(query: str) -> str:
run = AgentRun(name="request", start_time=datetime.now(UTC))
try:
with run_context(run):
return str(query_engine.query(query))
finally:
handler.clear()