# AbstractCore - llms-full (Self-Contained Agent Handbook)

> This is the "full context" companion to `llms.txt`: a single, self-contained guide for agents and developers using AbstractCore. It intentionally avoids GitHub link hubs; only a few external references are included (API-key signup pages).

Last updated: 2026-06-04
Package version: 2.13.35
Supported Python: 3.9+; GitHub CI tests 3.9, 3.10, 3.11, 3.12, and 3.13.

## How to use this file (agents)

- Treat this as the primary context for answering questions or making changes.
- Prefer the public docs in this repo as the source of truth; keep behavior claims consistent with `docs/`.
- Keep the default install lightweight: heavy/optional dependencies must stay behind extras and be imported lazily.
- Default tool execution is **pass-through**: AbstractCore returns tool calls; the host/runtime executes them.
- In the AbstractFramework ecosystem, **AbstractRuntime** is the recommended runtime for executing tool calls durably (policy, retries, persistence): https://github.com/lpalbou/abstractruntime
- Media handling is **policy-driven** by design (no silent semantic changes). If audio/video/images "don't work", check policy + configured fallbacks.
- Framework defaults use capability routes such as canonical `input.text`, fallback `input.image`, explicit `input.voice` and `input.video`, `embedding.text`, and future `rerank.text`; Core owns persistence and Gateway can act as a control plane. `output.text` is a read-only view of `input.text`.
- `llms.txt` is the short index; this file is intentionally large. Prefer pulling only the sections you need.

## Table of contents

1. Quick start (2 minutes)
2. What AbstractCore is (and isn't)
3. Installation + extras (what to `pip install`)
4. Providers (IDs, env vars, and examples)
5. Core Python API patterns (generate/stream/async/sessions)
6. Tool calling (agentic workflows)
7. Structured output (`response_model=...`)
8. Media handling (images/audio/video + documents)
9. Embeddings
10. CLI + centralized config (`abstractcore --config`)
11. Server + endpoint (OpenAI-compatible `/v1`)
12. Repo map + contribution workflow
13. Troubleshooting checklist

---

## 1) Quick start (2 minutes)

### Install

```bash
pip install abstractcore
```

If you're using a cloud provider SDK, install only what you need:

```bash
pip install "abstractcore[openai]"      # OpenAI Python SDK
pip install "abstractcore[anthropic]"   # Anthropic Python SDK
```

### Minimal first call

```python
from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")  # requires: abstractcore[openai]
resp = llm.generate("Say hello in French.")
print(resp.content)
```

### If you're using a local OpenAI-compatible server

```python
from abstractcore import create_llm

llm = create_llm("openai-compatible", model="default", base_url="http://localhost:1234/v1")
print(llm.generate("Hello!").content)
```

Important: most OpenAI-compatible servers expect the base URL to include `/v1`.

### Examples (runnable tour)

Start with the curated examples index:
- `examples/README.md`

Recommended guided path (read + run in order):
```bash
python examples/learning_path/01_basic_generation.py
python examples/learning_path/02_provider_configuration.py
python examples/learning_path/03_tool_calling.py
python examples/learning_path/04_unified_streaming.py
python examples/learning_path/05_server_agentic_cli.py
python examples/learning_path/06_production_patterns.py
```

---

## 2) What AbstractCore is (and isn't)

### What it is

AbstractCore is a unified Python interface over multiple LLM backends (cloud + local), with consistent support for:

- **Streaming** (`stream=True`)
- **Tool calling** (`@tool`), with a universal tool representation across providers
- **Structured output** via Pydantic (`response_model=...`)
- **Media handling** for images/audio/video/documents via an explicit, policy-driven system
- **Embeddings** (optional)
- **Durable memory bloc cache artifacts** for exact local text/file prefix reuse across MLX,
  HuggingFace Transformers, and supported HuggingFace GGUF exact-renderer paths
- **An optional OpenAI-compatible HTTP server** (`/v1/chat/completions`, plus optional images/audio endpoints via plugins)

### What it is not

- It is not "one giant dependency": the default install is intentionally small.
- It is not "magic multimodal": attaching audio/video/images has explicit policies and optional fallbacks.
- It is not a hosted service: local providers and servers are your responsibility to run and secure.

### Core design invariant: lightweight default install

`pip install abstractcore` should:
- install quickly,
- import cleanly,
- not pull heavyweight deps (torch/transformers/PDF pipelines/server deps).

Anything heavy must be behind install extras and imported lazily inside the code paths that need it.

---

## 3) Installation + extras (what to `pip install`)

### Core

```bash
pip install abstractcore
```

Core includes local HTTP/gateway providers that need no SDK: `ollama`,
`lmstudio`, `openrouter`, `portkey`, and generic `openai-compatible` `/v1`
endpoints.

### Hosted SDK / provider extras (install only what you use)

```bash
pip install "abstractcore[remote]"      # OpenAI + Anthropic SDKs
pip install "abstractcore[openai]"
pip install "abstractcore[anthropic]"
pip install "abstractcore[huggingface]"  # transformers/torch (heavy)
pip install "abstractcore[apple]"        # Apple Silicon local LLM stack (alias of mlx; heavy)
pip install "abstractcore[gpu]"          # GPU local LLM stack (alias of vllm; heavy)
pip install "abstractcore[mlx]"          # Explicit MLX provider extra
pip install "abstractcore[vllm]"         # Explicit vLLM provider extra
```

Notes:
- `ollama`, `lmstudio`, `openrouter`, `portkey`, and `openai-compatible` use only core deps (they speak HTTP).
- `openrouter`, `portkey`, and `openai-compatible` also have explicit no-dependency extras for clarity/composability.

### Optional feature extras

```bash
pip install "abstractcore[tools]"        # built-in web + filesystem helper tools
pip install "abstractcore[media]"        # images + PDF/Office extraction
pip install "abstractcore[compression]"  # glyph visual-text compression
pip install "abstractcore[embeddings]"   # EmbeddingManager + local embedding models
pip install "abstractcore[tokens]"       # precise token counting (tiktoken)
pip install "abstractcore[voice]"        # remote-light abstractvoice TTS/STT capability
pip install "abstractcore[vision]"       # remote-light abstractvision capability
pip install "abstractcore[music]"        # remote-light abstractmusic capability
pip install "abstractcore[server]"       # OpenAI-compatible /v1 HTTP gateway (FastAPI)
```

Compatibility note:
- `abstractcore[tool]` is accepted as an alias of `abstractcore[tools]`.

### Turnkey installs (pick one)

```bash
pip install "abstractcore[all-apple]"    # Apple Silicon: remote SDKs + HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]"      # NVIDIA GPU: remote SDKs + HF/GGUF + vLLM + features + server
```

Install-profile note:
`apple`/`gpu` are hardware-profile aliases for the local LLM engine stack.
Capability extras such as `voice`, `audio`, `vision`, and `music` install the
lightweight plugin paths used for remote-capable routing. `all-apple`/`all-gpu`
are larger aggregate profiles for a full local-development environment,
including local plugin engines where supported.

Shell tip:
- In zsh, always quote extras: `pip install "abstractcore[media]"`.
- `all-non-mlx` remains as a legacy broad bundle, but new installs should usually compose explicit extras or choose `all-apple` / `all-gpu`.

---

## 4) Providers (IDs, env vars, and examples)

AbstractCore uses a **provider ID** plus a **model name**:

```python
from abstractcore import create_llm
llm = create_llm("openai", model="gpt-4o-mini")
```

### Provider ID list (common)

- Cloud: `openai`, `anthropic`
- Gateways (OpenAI-compatible routing): `openrouter`, `portkey`
- Local/self-hosted (OpenAI-compatible HTTP): `ollama`, `lmstudio`, `vllm`, `openai-compatible`
- Local in-process: `mlx`, `huggingface` (require heavy extras)

### Environment variables (quick map)

Cloud / gateways:
- `OPENAI_API_KEY`
- `ANTHROPIC_API_KEY`
- `OPENROUTER_API_KEY` (optional: `OPENROUTER_BASE_URL`, `OPENROUTER_SITE_URL`, `OPENROUTER_APP_NAME`)
- `PORTKEY_API_KEY` (routing: `PORTKEY_CONFIG` or `PORTKEY_VIRTUAL_KEY`; provider-direct: `PORTKEY_PROVIDER` + `PORTKEY_PROVIDER_API_KEY`; optional: `PORTKEY_BASE_URL`)

OpenAI-compatible base URLs (local/self-hosted):
- `OLLAMA_BASE_URL` (legacy: `OLLAMA_HOST`)
- `LMSTUDIO_BASE_URL`
- `VLLM_BASE_URL`
- `OPENAI_BASE_URL` (generic provider)

OpenAI-compatible optional auth:
- `OPENAI_API_KEY`

HuggingFace caching (optional):
- `HF_HOME` (or rely on defaults)

### 4.1 OpenAI (`openai`)

Install:

```bash
pip install "abstractcore[openai]"
export OPENAI_API_KEY="sk-..."
```

Use:

```python
from abstractcore import create_llm
llm = create_llm("openai", model="gpt-4o-mini")
print(llm.generate("Give me 3 bullet points about HTTP caching.").content)
```

### 4.2 Anthropic (`anthropic`)

Install:

```bash
pip install "abstractcore[anthropic]"
export ANTHROPIC_API_KEY="sk-ant-..."
```

Use:

```python
from abstractcore import create_llm
llm = create_llm("anthropic", model="claude-haiku-4-5")
print(llm.generate("Write a haiku about distributed systems.").content)
```

### 4.3 OpenRouter gateway (`openrouter`)

OpenRouter is an OpenAI-compatible gateway/aggregator.

Setup:

```bash
export OPENROUTER_API_KEY="sk-or-..."
# Optional override (default: https://openrouter.ai/api/v1)
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
```

Optional analytics headers:

```bash
export OPENROUTER_SITE_URL="https://your-site.example"
export OPENROUTER_APP_NAME="YourAppName"
```

Use:

```python
from abstractcore import create_llm

llm = create_llm("openrouter", model="openai/gpt-4o-mini")
print(llm.generate("Say hello in Japanese.").content)
```

### 4.4 Portkey gateway (`portkey`)

Portkey is an OpenAI-compatible AI gateway that routes requests via headers.

Setup (most common: config routing):

```bash
export PORTKEY_API_KEY="pk_..."
export PORTKEY_CONFIG="pcfg_..."  # config id
# Optional override (default: https://api.portkey.ai/v1)
export PORTKEY_BASE_URL="https://api.portkey.ai/v1"
```

Use (config mode):

```python
from abstractcore import create_llm

llm = create_llm("portkey", model="gpt-4o-mini", config_id="pcfg_...")
print(llm.generate("Say hello in French.").content)
```

Portkey routing modes (pick one; don't mix):
- **Config mode**: `PORTKEY_CONFIG` or `config_id=...` -> sends `x-portkey-config`
- **Virtual-key mode**: `PORTKEY_VIRTUAL_KEY` or `virtual_key=...` -> sends `x-portkey-virtual-key`
- **Provider-direct mode**: `PORTKEY_PROVIDER` / `portkey_provider=...` + `PORTKEY_PROVIDER_API_KEY` / `provider_api_key=...` -> sends `x-portkey-provider` + backend auth

Gateway parameter safety:
- Gateways forward your payload to a routed backend model.
- To avoid sending defaults that strict models reject, AbstractCore's gateway providers forward optional generation parameters (like `temperature`, `top_p`, `max_output_tokens`) **only when you explicitly set them**.

### 4.5 Generic OpenAI-compatible (`openai-compatible`)

Best for: any OpenAI-compatible `/v1` endpoint (llama.cpp servers, LocalAI, text-generation-webui, custom proxies).

Setup:

```bash
export OPENAI_BASE_URL="http://localhost:1234/v1"
# Optional (if your endpoint requires auth)
export OPENAI_API_KEY="your-api-key"
```

Use:

```python
from abstractcore import create_llm

llm = create_llm("openai-compatible", model="default", base_url="http://localhost:1234/v1")
print(llm.generate('Give me 3 synonyms for "fast".').content)
```

### 4.6 Ollama (`ollama`)

Ollama runs a local HTTP server. Typical base URL: `http://localhost:11434`.

```python
from abstractcore import create_llm

llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M", base_url="http://localhost:11434")
print(llm.generate("Explain what a mutex is.").content)
```

### 4.7 LM Studio (`lmstudio`)

LM Studio's OpenAI-compatible base URL commonly ends with `/v1` (example: `http://localhost:1234/v1`).

```python
from abstractcore import create_llm

llm = create_llm("lmstudio", model="qwen/qwen3-4b-2507", base_url="http://localhost:1234/v1")
print(llm.generate("Write a one-line joke about compilers.").content)
```

### 4.8 vLLM (`vllm`)

vLLM is a GPU inference server (NVIDIA CUDA only). Typical base URL: `http://localhost:8000/v1`.

```python
from abstractcore import create_llm

llm = create_llm("vllm", model="Qwen/Qwen3-Coder-30B-A3B-Instruct", base_url="http://localhost:8000/v1")
print(llm.generate("Write a Python function that reverses a list.").content)
```

### 4.9 MLX (`mlx`)

MLX runs in-process on Apple Silicon and requires a heavy extra:

```bash
pip install "abstractcore[mlx]"
```

```python
from abstractcore import create_llm

llm = create_llm("mlx", model="mlx-community/Qwen3-4B")
print(llm.generate("Summarize the CAP theorem.").content)
```

### 4.10 HuggingFace (`huggingface`)

HuggingFace runs in-process and requires transformers/torch:

```bash
pip install "abstractcore[huggingface]"
```

Quantized Transformers checkpoints such as AWQ, GPTQ, bitsandbytes, and
compressed-tensors may require optional quantization runtimes beyond the
baseline HuggingFace extra. This is a general model-load compatibility issue,
not a prompt-cache issue. If a model reports missing base weights, unexpected
packed weights, or produces nonsense for a trivial prompt, choose a compatible
runtime/model pair or use the provider-native MLX/GGUF path. See
`docs/huggingface-model-compatibility.md`.

```python
from abstractcore import create_llm

llm = create_llm("huggingface", model="unsloth/Qwen3-4B-Instruct-2507-GGUF")
print(llm.generate("Explain what RAG is in 3 bullets.").content)
```

---

## 5) Core Python API patterns (generate/stream/async/sessions)

### `create_llm(provider, model=..., **kwargs)`

```python
from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini", temperature=0.2)
```

Common kwargs (best-effort across providers):
- `temperature`
- `top_p`
- `top_k`
- `seed`
- `thinking` (`None|"auto"|"on"|"off"|"none"|True|False|"minimal"|"low"|"medium"|"high"|"xhigh"`)

Generation defaults from model metadata, architecture metadata, or loaded HF `generation_config.json` are defaults only. Constructor kwargs and per-call `generate(..., temperature=..., top_p=..., top_k=...)` values override those defaults whenever the target provider/backend accepts the parameter.

### `generate(prompt_or_messages, ...)`

```python
resp = llm.generate("Hello!")
print(resp.content)
print(resp.usage)      # provider-dependent
print(resp.tool_calls) # pass-through by default
print(resp.metadata)   # provider/model specific (e.g., normalized reasoning channel)
```

### Streaming

```python
for chunk in llm.generate("Write a short poem.", stream=True):
    print(chunk.content or "", end="", flush=True)
```

### Async

```python
import asyncio
from abstractcore import create_llm

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")
    resp = await llm.agenerate("Give me 3 bullet points about HTTP/3.")
    print(resp.content)

asyncio.run(main())
```

### Sessions (`BasicSession`)

Use sessions to keep conversation state and shared defaults:

```python
from abstractcore import BasicSession, create_llm

session = BasicSession(create_llm("anthropic", model="claude-haiku-4-5"), temperature=0.3)
print(session.generate("Give me 3 startup name ideas.").content)
print(session.generate("Pick the best one and explain why.").content)
```

### Prompt caching sessions (`CachedSession`)

`CachedSession` is `BasicSession` plus best-effort **provider prompt caching** (reusing stable prefixes and/or sending only deltas when supported).

```python
from abstractcore import CachedSession, create_llm

llm = create_llm("mlx", model="mlx-community/Qwen3-4B")  # requires: abstractcore[mlx]
session = CachedSession(llm, system_prompt="You are a helpful assistant.", prompt_cache_strategy="auto")

# Attach local text/doc files as stable transcript "boxes" (1 file = 1 message).
session.attach_files(["/path/to/A.md", "/path/to/B.md"])

print(session.generate("What’s the key difference between A and B?").content)
```

See: `docs/prompt-caching.md` and `examples/prompt_caching/README.md`.

---

## 6) Tool calling (agentic workflows)

### Define tools with `@tool`

```python
from abstractcore import tool

@tool
def get_weather(city: str) -> str:
    """Return a short weather string for a city."""
    return f"{city}: 22C and sunny"
```

### Pass-through is the default

By default, AbstractCore does **not** execute tools. Instead:
- the model emits tool calls,
- AbstractCore parses/normalizes them,
- your host/runtime executes them (or ignores them),
- you feed the tool results back in if you want.

```python
from abstractcore import create_llm, tool

@tool
def add(a: int, b: int) -> int:
    return a + b

llm = create_llm("openai", model="gpt-4o-mini")
resp = llm.generate("Use the add tool to compute 2+3.", tools=[add])

print(resp.tool_calls)  # structured calls for the host to run
```

Hybrid note (tools + structured output):
- If you pass both `tools=[...]` and `response_model=...` in one `generate()` call, AbstractCore uses a 2-pass hybrid flow: first a tool-capable call, then a structured-output call.
- Streaming is not supported in this hybrid mode.

### Built-in tools (optional)

Install:

```bash
pip install "abstractcore[tools]"
```

Then import from `abstractcore.tools.common_tools` (examples):
- `skim_websearch` vs `web_search` (compact vs full search results)
- `skim_url` vs `fetch_url` (fast triage vs full fetch + parsing)

Recommended agent workflow to keep outputs small:

1. `skim_websearch` to get a shortlist
2. `skim_url` to validate which links are worth opening
3. `fetch_url` only for the final few sources (use `include_full_content=False` when you need a smaller result)

### Tool-call syntax rewriting (preserve markup)

Some agent runtimes want tool calls preserved in `response.content` using custom tags.

- Python: pass `tool_call_tags=...` to `generate()` / `agenerate()`
- Server: set `agent_format` in requests

This is documented as `tool syntax rewriting` and is designed to keep tool-call markup stable across providers.

---

## 7) Structured output (`response_model=...`)

Structured output turns `return JSON` prompts into typed objects.

```python
from pydantic import BaseModel
from abstractcore import create_llm

class Answer(BaseModel):
    title: str
    bullets: list[str]

llm = create_llm("openai", model="gpt-4o-mini")
result = llm.generate("Summarize HTTP/3 in 3 bullets.", response_model=Answer)
print(result.title)
print(result.bullets)
```

How it works (high level):
- When the provider supports native structured output, AbstractCore uses it.
- Otherwise, it uses prompted strategies plus validation/retry to produce a valid object.

Practical tips:
- Keep schemas small and unambiguous.
- If validation fails, check the error and simplify the schema or give the model a clearer extraction instruction.

---

## 8) Media handling (images/audio/video + documents)

Media is opt-in and policy-driven to avoid silent semantic changes.

### Installation

```bash
pip install "abstractcore[media]"
```

### Attach media

```python
from abstractcore import create_llm

llm = create_llm("anthropic", model="claude-haiku-4-5")
resp = llm.generate("Describe the image.", media=["./image.png"])
print(resp.content)
```

### Policies (important)

Audio and video input are controlled by explicit policies:
- `audio_policy`: `native_only|speech_to_text|auto|caption`
- `video_policy`: `native_only|frames_caption|auto`

Defaults are explicit. Native model support may handle the media directly, but
text-only fallback requires configured capability routes such as `input.voice`
for STT or `input.video` for a dedicated video/VLM route.

### Vision fallback (for text-only main models)

If your main model is text-only, configure image/video fallback through
capability routes. `input.image` can be covered by a vision-capable
`input.text` route; `input.video` can also be covered but remains overrideable.

Config CLI examples:

```bash
abstractcore --set-vision-provider huggingface Salesforce/blip-image-captioning-base
abstractcore --add-vision-fallback lmstudio qwen/qwen3-vl-4b
abstractcore --disable-vision
```

### Video fallback requirements

Frame sampling requires `ffmpeg` and `ffprobe` available on `PATH`.

Helpful defaults:

```bash
abstractcore --set-video-strategy auto
abstractcore --set-video-max-frames 6
abstractcore --set-video-sampling-strategy keyframes
abstractcore --set-video-max-frame-side 1024
```

### Audio fallback requirements

Speech-to-text fallback uses the configured `input.voice` capability route:

```bash
pip install abstractvoice
abstractcore config set-default input.voice --provider faster-whisper --model large-v3
abstractcore --set-audio-strategy auto
abstractcore --set-stt-language en
```

### Capability plugins (voice/audio/vision/music)

AbstractCore supports optional capability plugins discovered via Python entry points:
- install `abstractvoice` -> enables `llm.voice` and `llm.audio` (TTS/STT)
- install `abstractvision` -> enables `llm.vision` (generative images/video through OpenAI-compatible endpoints or local backends such as MLX-Gen)
- install `abstractcore[music]` -> enables `llm.music` and music output routing for text-to-music through `abstractmusic`
- current plugin floors in this source: `abstractvoice>=0.10.17`, `abstractvision>=0.3.22`, `abstractmusic>=0.1.13`; `llm.vision` exposes `t2i`, `i2i`, `upscale_image`, `t2v`, and `i2v`, supports SeedVR2 image upscaling and typed Wan A14B `guidance_2` video controls, and `generate(..., output=...)` forwards progress callbacks for generated image/video outputs
- SeedVR2 upscaling can be called directly with `llm.vision.upscale_image("input.png", provider="mlx-gen", model="AbstractFramework/seedvr2-3b-8bit", scale="2x", on_progress=on_progress)` or through the unified output route with `llm.generate(media={"type": "image", "path": "input.png", "role": "source"}, output={"task": "image_upscale", "provider": "mlx-gen", "model": "AbstractFramework/seedvr2-3b-8bit", "scale": "2x"})`; canonical q8/q4 SeedVR2 package ids do not need runtime `quantize`, official/source weights can use SeedVR2 runtime `quantize`, and prepared local folders can be passed as the model value when available.

Server note:
- the server can optionally expose `/v1/images/*`, `/v1/videos/*`, and `/v1/vision/*` model/control surfaces (requires `abstractvision` for local backends; remote OpenAI-compatible proxy paths can run through the server)
- the server can optionally expose `/v1/audio/*` (requires an audio/voice/music plugin; commonly `abstractvoice` or `abstractmusic`)

### Glyph visual-text compression (optional, experimental)

If you need to squeeze long text into smaller vision-friendly inputs, AbstractCore supports glyph compression.

```bash
pip install "abstractcore[compression]"
```

---

## 9) Embeddings

Install local HuggingFace/sentence-transformer embeddings explicitly:

```bash
pip install "abstractcore[embeddings]"
```

Remote/provider-backed embeddings are also supported. LM Studio, vLLM, and
generic OpenAI-compatible embedding endpoints use the core HTTP client and can
serve embedding-only `/v1/embeddings` routes without requiring the embedding
model to appear in a chat model catalogue. OpenAI embeddings require the
OpenAI SDK extra or another install profile that includes it.

Use:

```python
from abstractcore.embeddings import EmbeddingManager

em = EmbeddingManager()
vec = em.embed_text("hello world")
print(len(vec))
```

---

## 10) CLI + centralized config (`abstractcore --config`)

AbstractCore stores persistent configuration at:

`~/.abstractcore/config/abstractcore.json`

### Most-used commands

```bash
abstractcore --config
abstractcore --status

abstractcore --set-api-key openai sk-...
abstractcore --set-api-key anthropic sk-ant-...
abstractcore --set-api-key openrouter sk-or-...
abstractcore --set-api-key portkey pk-...
abstractcore --set-api-key openai-compatible endpoint-key
abstractcore --set-api-key vllm endpoint-key

abstractcore --set-server-auth-token acore-server-secret
abstractcore --set-server-base-url-allowlist "https://example.com/v1"
abstractcore --set-server-url-fetch-allowlist "https://files.example.com"
abstractcore --set-server-media-root /srv/abstractcore-media

abstractcore --set-chat-model openai/gpt-4o-mini
abstractcore --set-code-model anthropic/claude-haiku-4-5
```

HTTP server settings persisted here are injected into `ABSTRACTCORE_SERVER_*`,
`HOST`, and `PORT` when those environment variables are not already set. A
server auth token authenticates clients to AbstractCore; provider keys remain
separate upstream credentials.

### Priority order (high -> low)

1. Explicit parameters (per call / per request)
2. App-specific configuration (CLI apps)
3. Global configuration
4. Hardcoded defaults

### Logging controls

```bash
abstractcore --set-console-log-level DEBUG
abstractcore --enable-file-logging
abstractcore --set-log-base-dir ~/.abstractcore/logs
abstractcore --status
```

### Installed console scripts (quick map)

These entrypoints are defined in `pyproject.toml` and are available after install:

- `abstractcore` / `abstractcore-config`: configuration CLI (`--config`, `--status`, `--set-api-key`, defaults)
- `abstractcore-chat`: interactive REPL for chatting with a provider/model
- `abstractcore-endpoint`: single-model OpenAI-compatible endpoint server (local inference hosting)
- Built-in apps: `summarizer`, `extractor`, `judge`, `intent`, `deepsearch` (run `--help` for each)

Examples:

```bash
abstractcore-chat --provider openai --model gpt-4o-mini
summarizer ./document.txt --provider ollama --model gemma3:1b-it-qat
```

---

## 11) Server + endpoint (OpenAI-compatible `/v1`)

The server turns AbstractCore into an OpenAI-compatible API gateway.

### Install + run

```bash
pip install "abstractcore[server]"
abstractcore serve
```

Health check:

```bash
curl http://localhost:8000/health
```

Interactive docs:
- Swagger UI: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`
- Docs Lite: `http://localhost:8000/docs-lite`

When server auth is enabled, `/docs` keeps Swagger UI's normal `Authorize`
button, but AbstractCore wraps that action and calls `/acore/auth/validate`
before Swagger stores a bearer token for `Try it out` requests. Invalid tokens
are not treated as authenticated.
When server auth is disabled, the server bearer scheme is omitted from the
OpenAPI docs so Swagger does not show a misleading server-token authorize flow.

### Model naming

Requests use `provider/model`:
- `openai/gpt-4o-mini`
- `anthropic/claude-haiku-4-5`
- `ollama/qwen3:4b-instruct-2507-q4_K_M`

### Chat completions request

```bash
curl -X POST http://localhost:8000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -H "Authorization: Bearer $ABSTRACTCORE_AUTH_TOKEN" \\
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

### AbstractCore server extensions (important)

The server supports a few non-OpenAI fields to make multi-provider routing practical:

- `agent_format`: tool-call syntax output format for agentic clients (`auto|openai|codex|qwen3|llama3|gemma|xml|passthrough`)
- `base_url`: per-request provider base URL override (loopback-only by default; non-loopback hosts require `ABSTRACTCORE_SERVER_BASE_URL_ALLOWLIST`)
- `thinking`: unified thinking/reasoning control (`null|"auto"|"on"|"off"|"none"|...`)
- `prompt_cache_key` / `prompt_cache_retention`: best-effort provider prompt-cache controls
- `unload_after`: best-effort model unload after request (dangerous in multi-tenant environments)

Provider keys are not accepted in request bodies. Configure provider keys on the
server, send `X-AbstractCore-Provider-API-Key` as a per-request upstream override,
or use `Authorization: Bearer <provider-key>` only when `ABSTRACTCORE_AUTH_TOKEN`
is not configured.
When server auth is configured, `Authorization` is the AbstractCore server auth token and is not forwarded upstream.

### Provider base_url override example

```bash
curl -X POST http://localhost:8000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -H "Authorization: Bearer $ABSTRACTCORE_AUTH_TOKEN" \\
  -d '{
    "model": "lmstudio/qwen/qwen3-4b-2507",
    "base_url": "http://localhost:1234/v1",
    "messages": [{"role": "user", "content": "Hello from LM Studio"}]
  }'
```

### Optional images/audio endpoints

- Images/video: `POST /v1/images/generations`, `POST /v1/images/edits`, `POST /v1/images/upscale`, `POST /v1/videos/generations`, `POST /v1/videos/edits`, and async `/v1/vision/jobs/images/*` / `/v1/vision/jobs/videos/*` (requires `abstractvision` for local backends; global routes accept `model`, optional `provider`, and optional `base_url`; image edits accept repeatable multipart `reference_images`; exact MLX-Gen ids such as `mlx-gen/AbstractFramework/qwen-image-2512-4bit`, `mlx-gen/AbstractFramework/seedvr2-3b-8bit`, `mlx-gen/AbstractFramework/seedvr2-7b-8bit`, `mlx-gen/AbstractFramework/wan2.2-t2v-a14b-diffusers-8bit`, and `mlx-gen/AbstractFramework/wan2.2-i2v-a14b-diffusers-8bit` are valid when cached; SeedVR2 upscale routes accept `scale` or `resolution` plus optional runtime `quantize`)
- Audio: `POST /v1/audio/transcriptions`, `POST /v1/audio/speech`, `POST /v1/voice/clone` (requires an audio/voice plugin; commonly `abstractvoice`; global routes accept `model`, optional `provider`, and optional `base_url`)

If a required plugin/backend is missing, the server returns `501` with actionable messaging.

### Single-model endpoint (one provider/model per worker)

If you want to host **one** provider+model as a dedicated OpenAI-compatible `/v1` endpoint (no `provider/model` routing), use `abstractcore-endpoint`:

```bash
pip install "abstractcore[server]"
abstractcore-endpoint --provider mlx --model mlx-community/Qwen3-4B --host 0.0.0.0 --port 8001
```

Config via env vars (alternative):
- `ABSTRACTENDPOINT_PROVIDER`
- `ABSTRACTENDPOINT_MODEL`
- `ABSTRACTENDPOINT_HOST`
- `ABSTRACTENDPOINT_PORT`

See `docs/endpoint.md` and `abstractcore/endpoint/app.py`.

---

## 12) Repo map + contribution workflow

### Where things live

Core API:
- `abstractcore/core/` (interfaces, sessions, types, factory)
- `abstractcore/core/factory.py` (`create_llm(...)`)

Providers:
- `abstractcore/providers/` (implementations)
- `abstractcore/providers/base.py` (shared provider logic: tools/media/structured output)
- `abstractcore/providers/registry.py` (single source of truth for provider IDs/metadata)

Tools:
- `abstractcore/tools/` (tool system: decorator, registry, parsing, handler)
- `abstractcore/tools/common_tools.py` (built-in tools; requires `abstractcore[tools]`)

Media + capabilities:
- `abstractcore/media/` (media pipeline; requires `abstractcore[media]`)
- `abstractcore/capabilities/` (capability registry + proxies)

Server:
- `abstractcore/server/app.py` (multi-provider `/v1` gateway)
- `abstractcore/endpoint/app.py` (single-model endpoint)

Docs:
- `docs/` (canonical docs)

Examples:
- `examples/` (runnable tutorials; start at `examples/README.md`)

Tests:
- `tests/` (pytest)

### Dev commands

```bash
pip install -e ".[dev,test]"
pytest -q
black .
ruff check .
```

### Rules for optional dependencies (must-follow)

1. Don't import optional deps in default import paths (especially `abstractcore/__init__.py`).
2. Use lazy imports inside functions/methods where optional deps are used.
3. When a dep is missing, raise a clear error telling users which extra to install, e.g. `pip install "abstractcore[media]"`.
4. Keep features modular (tools/media/embeddings/compression/server) so core stays small.

### Adding a provider (checklist)

1. Implement provider in `abstractcore/providers/`.
2. Register it in `abstractcore/providers/registry.py` (ID, defaults, supported features, install hints).
3. Ensure default install still imports cleanly.
4. Add tests under `tests/` (unit tests that don't require real API keys when possible).
5. Update docs (at minimum: prerequisites + API + FAQ) and add a `CHANGELOG.md` entry.

---

## 13) Troubleshooting checklist

### Unsupported parameter errors (temperature/max_tokens/etc.)

Common with gateways and strict model families:
- Gateways forward your payload to a routed backend model.
- AbstractCore gateway providers only send optional generation parameters when explicitly set.
- If you still see errors, check that your gateway config isn't injecting forbidden parameters.

### Local OpenAI-compatible servers not working

- Ensure your base URL includes `/v1` (LM Studio, vLLM, many proxies).
- Confirm the server is reachable from your process (Docker networking is a common gotcha).

### Missing dependency errors

Typical fixes:
- Tools: `pip install "abstractcore[tools]"`
- Media/doc extraction: `pip install "abstractcore[media]"`
- Embeddings: `pip install "abstractcore[embeddings]"`
- Server: `pip install "abstractcore[server]"`

### Tools aren't executed

Expected default: pass-through. Execute tool calls in your host/runtime, or explicitly opt into an execution path if your app requires it.

### Media attachments fail

- Images/docs: install `abstractcore[media]`.
- Video frames fallback: install `abstractcore[media]` and have `ffmpeg`/`ffprobe` available.
- Audio STT fallback: install `abstractvoice` and set `audio_policy="auto"` (or configure via `abstractcore --set-audio-strategy auto`).

---

---

---

## Appendix A) Inlined canonical docs snapshot

> Generated from the current files in `docs/` on 2026-05-04. The source code remains the final truth for behavior; this appendix is here so agents can answer usage questions without following links.

---

### Inlined: `docs/getting-started.md`

# Getting Started

AbstractCore is a unified Python interface for cloud, gateway, and local LLM providers. The default install is lightweight; add only the extras your application needs.

## Prerequisites

- Python 3.9+
- `pip`

## Installation

Extras compose. For example, `abstractcore[remote,media,tools]` installs hosted
API SDKs plus document/media handling and built-in tools in one command.

```bash
# Core: local HTTP servers and gateways that need no SDK
# Includes Ollama, LM Studio, OpenRouter, Portkey, and OpenAI-compatible /v1 endpoints
pip install abstractcore

# Hosted API SDKs (OpenAI + Anthropic). OpenRouter/Portkey still work from core.
pip install "abstractcore[remote]"

# Individual provider SDKs / local runtimes
pip install "abstractcore[openai]"       # OpenAI SDK
pip install "abstractcore[anthropic]"    # Anthropic SDK
pip install "abstractcore[huggingface]"  # Transformers / torch (heavy)
pip install "abstractcore[apple]"        # Apple Silicon local LLM stack (alias of mlx; heavy)
pip install "abstractcore[gpu]"          # GPU local LLM stack (alias of vllm; heavy)
pip install "abstractcore[mlx]"          # Explicit MLX provider extra
pip install "abstractcore[vllm]"         # Explicit vLLM provider extra

# Optional features
pip install "abstractcore[tools]"        # built-in tools (web/file/command helpers)
pip install "abstractcore[media]"        # images, PDFs, Office docs
pip install "abstractcore[compression]"  # glyph visual-text compression (Pillow renderer)
pip install "abstractcore[embeddings]"   # EmbeddingManager + local embedding models
pip install "abstractcore[tokens]"       # precise token counting (tiktoken)
pip install "abstractcore[server]"       # OpenAI-compatible HTTP gateway

# Combine extras (zsh: keep quotes)
pip install "abstractcore[remote,media,tools]"

# Turnkey local-runtime installs
pip install "abstractcore[all-apple]"    # Apple Silicon: remote SDKs + HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]"      # NVIDIA GPU: remote SDKs + HF/GGUF + vLLM + features + server
```

`apple`/`gpu` are hardware-profile aliases for the local LLM engine stack.
Capability extras such as `voice`, `audio`, `vision`, and `music` install the
lightweight plugin paths used for remote-capable routing. `all-apple`/`all-gpu`
are larger aggregate profiles for a full local-development environment,
including local plugin engines where supported.

Local OpenAI-compatible servers (Ollama, LMStudio, vLLM, llama.cpp, LocalAI, etc.) work with the core install; you just point AbstractCore at the server base URL. See [Prerequisites](prerequisites.md) for provider setup.

Optional capability plugins (deterministic multimodal outputs):

```bash
pip install abstractvoice   # enables llm.voice / llm.audio (TTS/STT)
pip install abstractvision  # enables llm.vision (generative vision; typically via an OpenAI-compatible images endpoint)
```

See: [Capabilities](capabilities.md) and [Server](server.md).

## Providers and models

AbstractCore uses a provider ID plus a model name:

```python
from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")
# llm = create_llm("anthropic", model="claude-haiku-4-5")
# llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M")
# llm = create_llm("lmstudio", model="qwen/qwen3-4b-2507")
# llm = create_llm("openai-compatible", model="default", base_url="http://localhost:1234/v1")
```

Tip: you can omit `model=...`, but it’s usually better to pass an explicit model to avoid surprises when defaults change.

Open-source-first: start with local providers (Ollama, LMStudio, MLX, HuggingFace), then add cloud or gateway providers as needed.

Gateway providers (OpenRouter, Portkey) examples:

```python
from abstractcore import create_llm

llm_openrouter = create_llm("openrouter", model="openai/gpt-4o-mini")
llm_portkey = create_llm("portkey", model="gpt-5-mini", api_key="PORTKEY_API_KEY", config_id="pcfg_...")
```

Note: gateway providers only forward optional generation params (e.g. `temperature`, `top_p`, `max_output_tokens`) when you explicitly set them.

## Your first call

OpenAI example (requires `pip install "abstractcore[openai]"`):

```python
from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")
resp = llm.generate("What is the capital of France?")
print(resp.content)
```

## Sessions (multi-turn)

Use a session to keep conversation state (system prompt + message history) across turns:

```python
from abstractcore import BasicSession, create_llm

llm = create_llm("openai", model="gpt-4o-mini")
session = BasicSession(provider=llm, system_prompt="You are a helpful assistant.")

print(session.generate("Hello!").content)
print(session.generate("Now continue.").content)
```

For prompt-cache-aware long chats (reuse stable prefixes like system/tools/files), use `CachedSession`:
- See [Prompt Caching](prompt-caching.md).

## Thinking / reasoning (best-effort)

Many modern models can optionally emit a reasoning/thinking trace (sometimes in a separate channel, sometimes inline). AbstractCore exposes a single unified control:

```python
from abstractcore import create_llm

llm = create_llm("lmstudio", model="qwen3.5-27b@q4_k_m", base_url="http://localhost:1234/v1")

# Disable thinking (tries to suppress any reasoning trace)
resp = llm.generate("Compute 17*23 - 19*11. Reply with the integer only.", thinking="none")
print(resp.content)

# Enable thinking (levels are best-effort; not all backends support budgets)
resp = llm.generate("Solve a hard logic puzzle.", thinking="high")
print(resp.content)
print(resp.metadata.get("reasoning"))  # when the backend exposes it
```

Notes:
- For **Qwen3 / Qwen3.5 on LM Studio**, AbstractCore uses LM Studio’s model template variables (`enable_thinking` / `enableThinking`) and a Qwen template “hard switch” for `thinking="none"` (empty `<think></think>`), rather than injecting “Reasoning effort …” text into the system prompt.
- For **Qwen3 / Qwen3.5 GGUF via HuggingFaceProvider (llama-cpp-python)**, there is no template-kwargs knob exposed by llama-cpp-python today, so `thinking="none"` also uses the Qwen hard-switch marker. If GGUF loading fails due to huge advertised context windows, AbstractCore will retry with smaller `n_ctx` values (best-effort); you can also pass `max_tokens=...` when constructing `HuggingFaceProvider()` to explicitly control llama.cpp `n_ctx`.
- For **Ollama**, enabling thinking may consume a lot of output tokens in the thinking channel; consider using a larger `max_output_tokens` when `thinking` is enabled.

For server usage (OpenAI-compatible HTTP), see [Server](server.md) and [Generation Parameters](generation-parameters.md).

## Streaming

```python
from abstractcore import create_llm

llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M")
for chunk in llm.generate("Write a short poem about distributed systems.", stream=True):
    print(chunk.content or "", end="", flush=True)
```

## Tool calling

AbstractCore supports native tool calling (when the provider supports it) and prompted tool syntax (when it doesn’t).

By default, tool execution is pass-through (`execute_tools=False`): you get tool calls in `resp.tool_calls`, and your host/runtime decides how to execute them.

In the AbstractFramework ecosystem, **AbstractRuntime** is the recommended runtime for executing tool calls durably (policy, retries, persistence). See [Architecture](architecture.md) and [Tool Calling](tool-calling.md).

```python
from abstractcore import create_llm, tool

@tool
def get_weather(city: str) -> str:
    return f"{city}: 22°C and sunny"

llm = create_llm("openai", model="gpt-4o-mini")
resp = llm.generate("What's the weather in Paris? Use the tool.", tools=[get_weather])

print(resp.content)
print(resp.tool_calls)
```

See [Tool Calling](tool-calling.md) and [Tool Syntax Rewriting](tool-syntax-rewriting.md) (`tool_call_tags`, server `agent_format`).

Note:
- If you pass both `tools=[...]` and `response_model=...` to `generate()`, AbstractCore uses a 2-pass hybrid flow (tool-capable call, then structured-output call). Streaming is not supported in this hybrid mode.

### Built-in tools (optional)

If you want a ready-made toolset for agentic scripts, install:

```bash
pip install "abstractcore[tools]"
```

Then import from `abstractcore.tools.common_tools`:

- `skim_websearch` vs `web_search`: compact/filtered links vs full results
- `skim_url` vs `fetch_url`: fast URL triage (small output) vs full fetch + parsing for text-first types (HTML/JSON/text)

See [Tool Calling](tool-calling.md) for a recommended workflow and the full built-in tool list.

## Structured output

Pass a Pydantic model via `response_model=...` to get a typed result back (instead of parsing JSON yourself):

```python
from pydantic import BaseModel
from abstractcore import create_llm

class Answer(BaseModel):
    title: str
    bullets: list[str]

llm = create_llm("openai", model="gpt-4o-mini")
answer = llm.generate("Summarize HTTP/3 in 3 bullets.", response_model=Answer)
print(answer.bullets)
```

See [Structured Output](structured-output.md) for strategy details and limitations.

## Media input (images/audio/video + documents)

Images and document extraction require `pip install "abstractcore[media]"` (Pillow + PDF/Office deps).

```python
from abstractcore import create_llm

llm = create_llm("anthropic", model="claude-haiku-4-5")
resp = llm.generate("Describe the image.", media=["./image.png"])
print(resp.content)
```

Audio and video attachments are also supported, but they are **policy-driven** (no silent semantic changes):
- audio: `audio_policy` (`native_only|speech_to_text|auto|caption`)
- video: `video_policy` (`native_only|frames_caption|auto`)

Speech-to-text fallback for normal framework defaults requires an `input.voice`
route. Direct per-call `audio_policy="speech_to_text"` can still force STT with
explicit parameters.

What you need (quick checklist):
- **Images**: `abstractcore[media]` + either a vision-capable model (VLM/VL) **or** configured vision fallback (`abstractcore --set-vision-provider PROVIDER MODEL`).
- **Video**: `ffmpeg`/`ffprobe` on `PATH` + either native/visual support in the selected text route or an explicit `input.video` route. Native video input is model/provider dependent.
- **Audio**: either an audio-capable model **or** an explicit `input.voice` speech-to-text route.

Defaults can be configured via the config CLI (`abstractcore --config`, `abstractcore --status`). See [Centralized Config](centralized-config.md).

If your main model is text-only, you can configure vision fallback (two-stage captioning) so images are automatically described and injected as short observations. See [Media Handling](media-handling-system.md), [Vision Capabilities](vision-capabilities.md), and [Centralized Config](centralized-config.md).

For long documents, AbstractCore can optionally apply Glyph visual-text compression. Install `pip install "abstractcore[compression]"` (and `pip install "abstractcore[media]"` for PDFs) and see [Glyph Visual-Text Compression](glyphs.md).

## Async

```python
import asyncio
from abstractcore import create_llm

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")
    resp = await llm.agenerate("Give me 3 bullet points about HTTP caching.")
    print(resp.content)

asyncio.run(main())
```

## CLI (optional)

```bash
# Configure defaults and API keys
abstractcore --config
abstractcore --status

# Interactive chat
abstractcore-chat --provider openai --model gpt-4o-mini
```

## Next steps

- [Prerequisites](prerequisites.md) — provider setup (keys, base URLs, hardware notes)
- [FAQ](faq.md) — common questions and setup gotchas
- [Examples](examples.md) — end-to-end patterns and recipes
- [API (Python)](api.md) — public API map and common patterns
- [API Reference](api-reference.md) — complete function/class listing
- [Troubleshooting](troubleshooting.md) — common errors and fixes
- [Server](server.md) — OpenAI-compatible HTTP gateway
- [Endpoint](endpoint.md) — single-model OpenAI-compatible endpoint (one provider/model per worker)

---

### Inlined: `docs/prerequisites.md`

# Prerequisites & Setup Guide

This guide walks you through setting up AbstractCore with different LLM providers. Choose the provider(s) that are suitable for your needs — you can use multiple providers in the same application.

## Quick Decision Guide

**Want to get started immediately?** → [OpenAI Setup](#openai-setup) (requires API key)

**Want free local models?** → [Ollama Setup](#ollama-setup) (free, runs on your machine)

**Have Apple Silicon Mac?** → [MLX Setup](#mlx-setup) (optimized for M1/M2/M3/M4 chips)

**Have NVIDIA GPU?** → [vLLM Setup](#vllm-setup) (production GPU inference; NVIDIA CUDA only)

**Want a GUI for local models?** → [LMStudio Setup](#lmstudio-setup) (easiest local setup)

**Want a gateway/proxy?** → [Gateway Provider Setup](#gateway-provider-setup-openrouter-portkey) (OpenRouter/Portkey routing + governance)

**Using a custom OpenAI-compatible `/v1` endpoint?** → [OpenAI-Compatible Setup](#openai-compatible-setup)

## Core Installation

Install AbstractCore, then add the extras you need. Extras compose, so a real
application can use one command such as `pip install "abstractcore[remote,server,tools]"`.

```bash
# Core: local HTTP servers and gateways that need no SDK
# Includes Ollama, LM Studio, OpenRouter, Portkey, and OpenAI-compatible /v1 endpoints
pip install abstractcore

# Hosted API SDKs (OpenAI + Anthropic). OpenRouter/Portkey still work from core.
pip install "abstractcore[remote]"

# Individual provider SDKs / local runtimes
pip install "abstractcore[openai]"       # OpenAI SDK
pip install "abstractcore[anthropic]"    # Anthropic SDK
pip install "abstractcore[huggingface]"  # Transformers / torch (heavy)
pip install "abstractcore[apple]"        # Apple Silicon local LLM stack (alias of mlx; heavy)
pip install "abstractcore[gpu]"          # GPU local LLM stack (alias of vllm; heavy)
pip install "abstractcore[mlx]"          # Explicit MLX provider extra
pip install "abstractcore[vllm]"         # Explicit vLLM provider extra

# Optional features
pip install "abstractcore[tools]"       # built-in web tools (web_search, skim_websearch, skim_url, fetch_url)
pip install "abstractcore[media]"       # images, PDFs, Office docs
pip install "abstractcore[embeddings]"  # EmbeddingManager + local embedding models
pip install "abstractcore[tokens]"      # precise token counting (tiktoken)
pip install "abstractcore[server]"      # OpenAI-compatible HTTP gateway
pip install "abstractcore[compression]" # Glyph visual-text compression (Pillow renderer)

# Turnkey local-runtime installs
pip install "abstractcore[all-apple]"    # Apple Silicon: remote SDKs + HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]"      # NVIDIA GPU: remote SDKs + HF/GGUF + vLLM + features + server
```

**Hardware Notes:**
- `[apple]` - Native Apple local LLM stack; currently aliases `[mlx]` and only works on Apple Silicon (M1/M2/M3/M4)
- `[gpu]` - Local GPU LLM stack; currently aliases `[vllm]` and only works with supported CUDA/ROCm GPUs
- `[mlx]` - Provider-specific MLX extra, kept for explicit installs
- `[vllm]` - Provider-specific vLLM extra, kept for explicit installs
- `[remote]` - Lightweight hosted SDK bundle for OpenAI + Anthropic; OpenRouter, Portkey, Ollama, LM Studio, and generic `/v1` endpoints need no extra dependency.
- `[all-apple]` - Best for Apple Silicon local development (includes MLX and local plugin engines where supported, excludes vLLM)
- `[all-gpu]` - Best for NVIDIA GPU local development (includes vLLM and local plugin engines where supported, excludes MLX)
- For CPU-only or Intel machines, compose only what you need, for example `abstractcore[remote,huggingface,tools]`.

Capability extras such as `[voice]`, `[audio]`, `[vision]`, and `[music]`
install lightweight plugin routing surfaces for remote-capable backends. Local
inference engines remain behind explicit local profiles such as `[all-apple]`,
`[all-gpu]`, or plugin-specific local extras.

## Cloud Provider Setup

### OpenAI Setup

**Best for**: Production applications and OpenAI’s hosted models

#### 1. Get API Key

1. Go to [OpenAI API Dashboard](https://platform.openai.com/api-keys)
2. Create account or sign in
3. Click "Create new secret key"
4. Copy the key (starts with `sk-`)

#### 2. Set Environment Variable

```bash
# Option 1: Export in terminal (temporary)
export OPENAI_API_KEY="sk-your-actual-api-key-here"

# Option 2: Add to ~/.bashrc or ~/.zshrc (permanent)
echo 'export OPENAI_API_KEY="sk-your-actual-api-key-here"' >> ~/.bashrc
source ~/.bashrc

# Option 3: Create .env file in your project
echo 'OPENAI_API_KEY=sk-your-actual-api-key-here' > .env
```

#### 3. Test Setup

```python
from abstractcore import create_llm

# Test with an example model (use any model available on your account)
llm = create_llm("openai", model="gpt-4o-mini")
response = llm.generate("Say hello in French")
print(response.content)  # Should output: "Bonjour!"
```

**Model names**: Use any model supported by your account (examples: `gpt-4o-mini`, `gpt-4o`).

### Anthropic Setup

**Best for**: Claude models via Anthropic’s API

#### 1. Get API Key

1. Go to [Anthropic Console](https://console.anthropic.com/)
2. Create account or sign in
3. Go to "API Keys" section
4. Click "Create Key"
5. Copy the key (starts with `sk-ant-`)

#### 2. Set Environment Variable

```bash
# Option 1: Export in terminal (temporary)
export ANTHROPIC_API_KEY="sk-ant-your-actual-api-key-here"

# Option 2: Add to shell profile (permanent)
echo 'export ANTHROPIC_API_KEY="sk-ant-your-actual-api-key-here"' >> ~/.bashrc
source ~/.bashrc

# Option 3: Create .env file
echo 'ANTHROPIC_API_KEY=sk-ant-your-actual-api-key-here' > .env
```

#### 3. Test Setup

```python
from abstractcore import create_llm

# Test with an example model (use any model available on your account)
llm = create_llm("anthropic", model="claude-haiku-4-5")
response = llm.generate("Explain Python in one sentence")
print(response.content)
```

**Model names**: Use any model supported by your account (examples: `claude-haiku-4-5`, `claude-sonnet-4-5`).

### Gateway Provider Setup (OpenRouter, Portkey)

**Best for**: routing, observability/governance, and unified billing across multiple backends.

Gateways expose an OpenAI-compatible `/v1` endpoint and forward your payload to the routed backend model. Because some backends are strict (for example OpenAI reasoning families like gpt-5/o1 reject unsupported parameters), AbstractCore’s gateway providers forward optional generation parameters (like `temperature`, `top_p`, `max_output_tokens`) **only when explicitly set**.

No provider SDK extra is required for OpenRouter or Portkey; the core install uses AbstractCore's internal HTTP/OpenAI-compatible client.

#### OpenRouter Setup

1. Create an API key: https://openrouter.ai/keys
2. Set the environment variable:

```bash
export OPENROUTER_API_KEY="sk-or-..."
# Optional override (default: https://openrouter.ai/api/v1)
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
```

3. Test:

```python
from abstractcore import create_llm

llm = create_llm("openrouter", model="openai/gpt-4o-mini")
resp = llm.generate("Say hello in French")
print(resp.content)
```

#### Portkey Setup

Portkey routes requests using a **config id** (commonly `pcfg_...`).

1. Create an API key and config in Portkey, then copy:
   - `PORTKEY_API_KEY`
   - `PORTKEY_CONFIG` (config id)

2. Set environment variables:

```bash
export PORTKEY_API_KEY="pk_..."
export PORTKEY_CONFIG="pcfg_..."
# Optional override (default: https://api.portkey.ai/v1)
export PORTKEY_BASE_URL="https://api.portkey.ai/v1"
```

3. Test:

```python
from abstractcore import create_llm

llm = create_llm("portkey", model="gpt-4o-mini", config_id="pcfg_...")
resp = llm.generate("Say hello in French")
print(resp.content)
```

## Local Provider Setup

### Ollama Setup

**Best for**: Privacy, no API keys, offline usage, customization

**Requirements**: 8GB+ RAM, works on Mac/Linux/Windows

#### 1. Install Ollama

**macOS:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
# OR download from https://ollama.com/download
```

**Linux:**
```bash
curl -fsSL https://ollama.com/install.sh | sh
```

**Windows:**
1. Download installer from [ollama.com/download](https://ollama.com/download)
2. Run the installer
3. Restart terminal

#### 2. Start Ollama Service

```bash
# Start Ollama server (runs in background)
ollama serve
```

#### 3. Download Models

```bash
# Pull any model you want to use, then verify it's installed.
ollama pull qwen3:4b-instruct-2507-q4_K_M
ollama list
```

#### 4. Test Setup

```python
from abstractcore import create_llm

# Test with any model you installed via `ollama pull ...`
llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M")
response = llm.generate("What is Python?")
print(response.content)
```

### MLX Setup

**Best for**: M1/M2/M3/M4 Macs, optimized inference, good speed

**Requirements**: Apple Silicon Mac (M1/M2/M3/M4)

#### 1. Install MLX Dependencies

```bash
# MLX is automatically installed with AbstractCore
pip install "abstractcore[mlx]"
```

#### 2. Download Models

AbstractCore is offline-first for local model weights. The MLX provider loads
models from an explicit local path, the Hugging Face cache, or the LM Studio
cache; it does not silently download weights during `create_llm(...)`.

Download or prefetch a model first, for example:

```bash
huggingface-cli download mlx-community/Qwen3-4B
```

Then use the model ID, or pass a local model directory path:

```python
from abstractcore import create_llm

# Uses a cached Hugging Face snapshot or a local directory path.
llm = create_llm("mlx", model="mlx-community/Qwen3-4B")
# OR
llm = create_llm("mlx", model="/path/to/local/mlx-model")
```

#### 3. Test Setup

```python
from abstractcore import create_llm

# Test with a model you already downloaded/prefetched
llm = create_llm("mlx", model="mlx-community/Qwen3-4B")
response = llm.generate("Explain machine learning briefly")
print(response.content)
```

**Popular MLX Models**:
- `mlx-community/Llama-3.2-3B-Instruct-4bit` - 1.8GB, fast
- `mlx-community/Qwen2.5-Coder-7B-Instruct-4bit` - 4.2GB, suitable for code
- `mlx-community/Llama-3.1-8B-Instruct-4bit` - 4.7GB, high quality

### LMStudio Setup

**Best for**: Easy GUI management, Windows users, non-technical users

**Requirements**: 8GB+ RAM, works on Mac/Linux/Windows

#### 1. Install LMStudio

1. Download from [lmstudio.ai](https://lmstudio.ai/)
2. Install the application
3. Launch LMStudio

#### 2. Download Models

1. Open LMStudio
2. Go to "Discover" tab
3. Search for recommended models:
   - `microsoft/Phi-3-mini-4k-instruct-gguf` (small, fast)
   - `microsoft/Phi-3-medium-4k-instruct-gguf` (medium quality)
   - `meta-llama/Llama-2-7b-chat-gguf` (good general purpose)
4. Click download for your preferred model

#### 3. Start Local Server

1. Go to "Local Server" tab in LMStudio
2. Select your downloaded model
3. Click "Start Server"
4. Note the port (usually 1234)

#### 4. Test Setup

```python
from abstractcore import create_llm

# LM Studio exposes an OpenAI-compatible server (default: http://localhost:1234/v1).
# Use the model ID shown in LM Studio (or try "local-model" if unsure).
llm = create_llm("lmstudio", model="local-model", base_url="http://localhost:1234/v1")
resp = llm.generate("Hello, how are you?")
print(resp.content)
```

### HuggingFace Setup

**Best for**: Latest research models, custom models, GGUF files

**Requirements**: 8GB+ RAM, Python environment

#### 1. Install Dependencies

```bash
pip install "abstractcore[huggingface]"
```

#### 2. Optional: Get HuggingFace Token

For private models or higher rate limits:

1. Go to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Create a "Read" token
3. Set environment variable:

```bash
export HUGGINGFACE_TOKEN="hf_your-token-here"
```

#### 3. Test Setup

```python
from abstractcore import create_llm

# Use a small model for testing (auto-downloads)
llm = create_llm("huggingface", model="microsoft/DialoGPT-medium")
response = llm.generate("Hello there!")
print(response.content)
```

**Popular HuggingFace Models**:
- `microsoft/DialoGPT-medium` - Good for conversation
- `facebook/blenderbot-400M-distill` - Conversational AI
- `microsoft/CodeBERT-base` - Code understanding

Quantized Transformers checkpoints may need optional quantization runtimes
beyond `abstractcore[huggingface]`. If loading reports missing weights,
unexpected packed weights, or incorrect trivial output, treat it as a
model/runtime compatibility issue. See `docs/huggingface-model-compatibility.md`.

### vLLM Setup

**Best for**: Production GPU deployments, high-throughput inference, tensor parallelism

**Requirements**:
- **NVIDIA GPU with CUDA support** (A100, H100, RTX 4090, etc.)
- Linux operating system
- CUDA 12.1+ installed
- 16GB+ VRAM recommended
- **NOT compatible with**: Apple Silicon, AMD GPUs, CPU-only systems

**NVIDIA CUDA only.** If you’re on Apple Silicon, use MLX. If you’re on CPU-only, use Ollama/HuggingFace.

#### ⚠️ Hardware Compatibility Warning

**vLLM ONLY works with NVIDIA CUDA GPUs.** It will NOT work on:
- ❌ Apple Silicon (M1/M2/M3/M4) - Use MLX provider instead
- ❌ AMD GPUs - Use HuggingFace or Ollama instead
- ❌ Intel integrated graphics
- ❌ CPU-only systems

#### 1. Install vLLM

```bash
# Install AbstractCore with vLLM support
pip install "abstractcore[vllm]"

# This installs vLLM which requires NVIDIA CUDA
# If you get CUDA errors, ensure CUDA 12.1+ is installed:
# https://developer.nvidia.com/cuda-downloads
```

#### 2. Start vLLM Server

**IMPORTANT**: Check your GPU setup first to avoid Out Of Memory (OOM) errors:

```bash
# Check available GPUs
nvidia-smi

# Shows: GPU name, VRAM capacity, and current usage
# Example: 4x NVIDIA L4 (23GB each) = 92GB total
```

**Choose the right startup command based on your hardware:**

```bash
# Single GPU (24GB+) - Works for 7B-14B models
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct --port 8000

# Single GPU (24GB+) - For 30B models, reduce memory
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096

# Multiple GPUs (RECOMMENDED for 30B models) - Use tensor parallelism
# Example: 4x NVIDIA L4 (23GB each)
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --max-num-seqs 128

# Multiple GPUs + LoRA support (Production setup)
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 4 \
    --enable-lora --max-loras 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --max-num-seqs 128
```

**Key Parameters:**
- `--tensor-parallel-size N` - Split model across N GPUs (REQUIRED for 30B+ models on <40GB GPUs)
- `--gpu-memory-utilization 0.9` - Use 90% of GPU memory (leave 10% for CUDA overhead)
- `--max-model-len` - Maximum context length (reduce if OOM)
- `--max-num-seqs` - Maximum concurrent sequences (128 recommended for 30B models, default 256 may cause OOM)
- `--enable-lora` - Enable dynamic LoRA adapter loading
- `--max-loras` - Maximum number of LoRA adapters to keep in memory

**Troubleshooting OOM Errors:**

If you see `CUDA out of memory` errors:

1. **Reduce concurrent sequences**: `--max-num-seqs 128` (or 64, 32 for tighter memory)
2. **Enable tensor parallelism**: `--tensor-parallel-size 2` (or 4, 8 depending on GPU count)
3. **Reduce memory usage**: `--gpu-memory-utilization 0.85 --max-model-len 4096`
4. **Use smaller model**: `Qwen/Qwen2.5-Coder-7B-Instruct` instead of 30B
5. **Use quantized model**: `Qwen/Qwen2.5-Coder-30B-Instruct-AWQ` (4-bit quantization)

**Test server is running:**

```bash
# Check server health
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Test generation
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 50
  }'
```

#### 3. Test Setup

```python
from abstractcore import create_llm

# Basic generation
llm = create_llm("vllm", model="Qwen/Qwen3-Coder-30B-A3B-Instruct")
response = llm.generate("Write a Python function to sort a list")
print(response.content)

# With guided JSON (vLLM-specific feature)
response = llm.generate(
    "List 3 programming languages",
    guided_json={
        "type": "object",
        "properties": {
            "languages": {"type": "array", "items": {"type": "string"}}
        }
    }
)
print(response.content)
```

#### 4. vLLM-Specific Features

**Guided Decoding** (syntax-constrained generation):
```python
# Regex-constrained generation
response = llm.generate(
    "Write a Python function",
    guided_regex=r"def \w+\([^)]*\):\n(?:\s{4}.*\n)+"
)

# JSON schema enforcement
response = llm.generate(
    "Extract person info",
    guided_json={"type": "object", "properties": {...}}
)
```

**Multi-LoRA** (1 base model → many specialized agents):
```python
# Load specialized adapters
llm.load_adapter("sql-expert", "/models/adapters/sql-lora")
llm.load_adapter("react-dev", "/models/adapters/react-lora")

# Route to specialized adapter
response = llm.generate("Write SQL query", model="sql-expert")
```

**Beam Search** (higher accuracy for complex tasks):
```python
response = llm.generate(
    "Solve this complex algorithm problem...",
    use_beam_search=True,
    best_of=5  # Generate 5 candidates, return best
)
```

#### Environment Variables

```bash
# vLLM server URL (default: http://localhost:8000/v1)
export VLLM_BASE_URL="http://192.168.1.100:8000/v1"

# Optional API key (if server started with --api-key)
export VLLM_API_KEY="your-api-key"

# HuggingFace cache (shared with HF/MLX providers)
export HF_HOME="~/.cache/huggingface"
```

**Available Models**:
- `Qwen/Qwen3-Coder-30B-A3B-Instruct` (default) - Excellent for code
- `meta-llama/Llama-3.1-8B-Instruct` - Good general purpose
- `mistralai/Mistral-7B-Instruct-v0.3` - Fast and efficient
- Any HuggingFace model compatible with vLLM

**Performance notes**: Throughput depends on model size, context length, concurrency, quantization, and GPU. See vLLM docs for tuning knobs (`--tensor-parallel-size`, `--max-model-len`, `--max-num-seqs`, …).

### OpenAI-Compatible Setup

**Best for**: any OpenAI-compatible `/v1` endpoint (llama.cpp servers, LocalAI, text-generation-webui, custom proxies, etc.)

AbstractCore supports a generic OpenAI-compatible provider plus specific convenience providers (LM Studio, vLLM, OpenRouter, Portkey).

#### 1. Get the endpoint base URL

You must include `/v1` for OpenAI-compatible servers:

```bash
export OPENAI_BASE_URL="http://localhost:1234/v1"
# Optional (if your endpoint requires auth)
export OPENAI_API_KEY="your-api-key"
```

#### 2. Test Setup

```python
from abstractcore import create_llm

llm = create_llm("openai-compatible", model="default", base_url="http://localhost:1234/v1")
resp = llm.generate("Say hello in French")
print(resp.content)
```

## Troubleshooting

### Common Issues

#### "No module named .abstractcore."
```bash
# Make sure you installed AbstractCore
pip install abstractcore
```

#### "OpenAI API key not found"
```bash
# Check if environment variable is set
echo $OPENAI_API_KEY

# If empty, set it:
export OPENAI_API_KEY="sk-your-key-here"
```

#### "Connection error to Ollama"
```bash
# Make sure Ollama is running
ollama serve

# Check if models are available
ollama list

# Pull a model if none available
ollama pull gemma3:1b
```

#### "Model not found in MLX"
```python
# Use exact model names from HuggingFace MLX community
llm = create_llm("mlx", model="mlx-community/Llama-3.2-3B-Instruct-4bit")
```

#### "LMStudio connection refused"
```bash
# Make sure LMStudio server is running on correct port
# Check LMStudio logs for the exact port and URL
```

### Memory Issues

#### "Out of memory" with local models
```bash
# Try smaller models
ollama pull gemma3:1b        # Only 1.3GB
ollama pull tinyllama        # Only 637MB

# Or increase swap space on Linux
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
```

#### MLX models too slow
```python
# Use 4-bit quantized models for faster inference
llm = create_llm("mlx", model="mlx-community/Llama-3.2-3B-Instruct-4bit")
```

### API Key Issues

#### OpenAI billing issues
1. Check your [billing dashboard](https://platform.openai.com/account/billing)
2. Add payment method if needed
3. Check usage limits

#### Anthropic rate limits
1. Check your [console](https://console.anthropic.com/)
2. Upgrade to higher tier if needed
3. Implement retry logic in your code

## Testing Your Setup

### Universal Test Script

Save this as `test_setup.py` and run it to test all your providers:

```python
#!/usr/bin/env python3
"""Test script for AbstractCore providers"""

import os
from abstractcore import create_llm

def test_provider(provider_name, model, **kwargs):
    """Test a specific provider"""
    try:
        print(f"\n🧪 Testing {provider_name} with {model}...")
        llm = create_llm(provider_name, model=model, **kwargs)
        response = llm.generate("Say 'Hello from AbstractCore!'")
        print(f"[OK] {provider_name}: {response.content}")
        return True
    except Exception as e:
        print(f"[FAIL] {provider_name}: {e}")
        return False

def main():
    print("AbstractCore Provider Test Suite")
    print("=" * 40)

    results = {}

    # Test cloud providers (if API keys available)
    if os.getenv("OPENAI_API_KEY"):
        results["OpenAI"] = test_provider("openai", "gpt-4o-mini")
    else:
        print("\n⚠️  Skipping OpenAI (no OPENAI_API_KEY)")

    if os.getenv("ANTHROPIC_API_KEY"):
        results["Anthropic"] = test_provider("anthropic", "claude-haiku-4-5")
    else:
        print("\n⚠️  Skipping Anthropic (no ANTHROPIC_API_KEY)")

    if os.getenv("OPENROUTER_API_KEY"):
        results["OpenRouter"] = test_provider("openrouter", "openai/gpt-4o-mini")
    else:
        print("\n⚠️  Skipping OpenRouter (no OPENROUTER_API_KEY)")

    # Test local providers
    results["Ollama"] = test_provider("ollama", "gemma3:1b")

    try:
        results["MLX"] = test_provider("mlx", "mlx-community/Llama-3.2-3B-Instruct-4bit")
    except:
        print("\n⚠️  Skipping MLX (not on Apple Silicon or model not available)")

    try:
        # Note: OpenAI-compatible servers expect `/v1` in the base URL (LM Studio default is http://localhost:1234/v1)
        results["LMStudio"] = test_provider("lmstudio", "qwen/qwen3-4b-2507", base_url="http://localhost:1234/v1")
    except:
        print("\n⚠️  Skipping LMStudio (server not running on localhost:1234)")

    # Summary
    print("\n" + "=" * 40)
    print("Test Results:")
    working = [name for name, success in results.items() if success]
    if working:
        print(f"[OK] Working providers: {', '.join(working)}")
    else:
        print("[FAIL] No providers working")

    print("\n[INFO] Next steps:")
    print("- Add API keys for cloud providers")
    print("- Install Ollama and download models")
    print("- Start LMStudio local server")
    print("- See docs/prerequisites.md for detailed setup")

if __name__ == "__main__":
    main()
```

Run the test:
```bash
python test_setup.py
```

### Live API smoke tests (opt-in)

Some tests are intentionally **real network calls** and are disabled by default. To enable them, set:
- `ABSTRACTCORE_RUN_LIVE_API_TESTS=1`

Example (OpenRouter):
```bash
ABSTRACTCORE_RUN_LIVE_API_TESTS=1 OPENROUTER_API_KEY="$OPENROUTER_API_KEY" \
  python -m pytest -q tests/test_graceful_fallback.py::test_openrouter_generation_smoke
```

Local provider smoke tests use `ABSTRACTCORE_RUN_LOCAL_PROVIDER_TESTS=1` (and `ABSTRACTCORE_RUN_MLX_TESTS=1` for MLX).

## Security Notes

### API Keys
- Never commit API keys to version control
- Use environment variables or `.env` files
- Rotate keys periodically
- Monitor usage for unexpected spikes

### Local Models
- Local models keep data on your machine
- No internet required after initial download
- Models can be large (1GB-20GB+)
- Some models may have usage restrictions

### Network Security
- LMStudio and Ollama servers run locally by default
- Be careful exposing servers to network (use authentication)
- Consider firewall rules for production deployments

This setup guide should get you running with any AbstractCore provider. Choose what works well for your use case - you can always add more providers later!

---

### Inlined: `docs/api.md`

# API (Python)

This page is a user-facing map of the **public Python API** exposed from `abstractcore` (see `abstractcore/__init__.py`). For a complete listing of functions/classes (including events), see **[API Reference](api-reference.md)**.

New to AbstractCore? Start with **[Getting Started](getting-started.md)**.

Implementation pointers (source of truth):
- `create_llm`: `abstractcore/core/factory.py` → `abstractcore/providers/registry.py`
- `BasicSession`: `abstractcore/core/session.py`
- `CachedSession`: `abstractcore/core/cached_session.py` (prompt caching; see `docs/prompt-caching.md`)
- Response/types: `abstractcore/core/types.py`
- Tool decorator: `abstractcore/tools/core.py`

## Core entrypoints

### `create_llm(...)`

Create a provider instance:

```python
from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")  # requires: pip install "abstractcore[openai]"
resp = llm.generate("Hello!")
print(resp.content)
```

Provider IDs (common): `openai`, `anthropic`, `openrouter`, `portkey`, `ollama`, `lmstudio`, `vllm`, `openai-compatible`, `huggingface`, `mlx`.

### Gateway providers (OpenRouter, Portkey)

```python
from abstractcore import create_llm

llm_openrouter = create_llm("openrouter", model="openai/gpt-4o-mini")
llm_portkey = create_llm("portkey", model="gpt-5-mini", api_key="PORTKEY_API_KEY", config_id="pcfg_...")
```

Gateway notes:
- OpenRouter uses `OPENROUTER_API_KEY` (model names like `openai/...`).
- Portkey uses `PORTKEY_API_KEY` plus a config id (`PORTKEY_CONFIG`).
- Optional generation parameters (`temperature`, `top_p`, `max_output_tokens`, etc.) are only forwarded when explicitly set.

### `BasicSession`

Keep conversation state:

```python
from abstractcore import BasicSession, create_llm

session = BasicSession(create_llm("anthropic", model="claude-haiku-4-5"))  # requires: abstractcore[anthropic]
print(session.generate("Give me 3 name ideas.").content)
print(session.generate("Pick the best one.").content)
```

### `CachedSession` (prompt caching)

For prompt-cache-aware long chats (reuse stable prefixes like system/tools/files), use `CachedSession`:

```python
from abstractcore import CachedSession, create_llm

llm = create_llm("mlx", model="mlx-community/Qwen3-4B")  # requires: abstractcore[mlx]
session = CachedSession(provider=llm, system_prompt="You are helpful.", prompt_cache_strategy="auto")
session.attach_files(["/path/to/large_context.md"])
print(session.generate("Summarize the attached file.").content)
```

See **[Prompt Caching](prompt-caching.md)**.

### Durable memory bloc artifacts

For exact local-memory reuse, persist text/file content as a bloc, compile one provider/model
artifact, then load it into a runtime prompt-cache key:

```python
from abstractcore import create_llm, ensure_bloc_kv_artifact, load_bloc_kv_artifact
from abstractcore.core.file_blocs import FileBlocStore

llm = create_llm("mlx", model="mlx-community/Qwen3-4B")
store = FileBlocStore()
record = store.upsert(file_meta={...}, content="stable memory text")

ensure = ensure_bloc_kv_artifact(provider=llm, store=store, record=record, debug=True)
loaded = load_bloc_kv_artifact(provider=llm, store=store, record=record, key="work:memory")
resp = llm.generate("Use the loaded memory.", prompt_cache_binding=loaded.prompt_cache_binding)
```

The public helpers are exported from both `abstractcore` and `abstractcore.core`. The shared
contract currently covers MLX, HuggingFace transformers, and supported HuggingFace GGUF
exact-renderer paths. Artifact payloads are provider/model-native; the portable contract is the
manifest, binding object, Python helpers, and server route shape.

### `tool` (decorator)

Define tools in Python with a decorator, then pass them to `generate()` / `agenerate()`:

```python
from abstractcore import create_llm, tool

@tool
def get_weather(city: str) -> str:
    return f"{city}: 22°C and sunny"

llm = create_llm("openai", model="gpt-4o-mini")
resp = llm.generate("Use the tool.", tools=[get_weather])
print(resp.tool_calls)
```

## Responses (`GenerateResponse`)

Most calls return a `GenerateResponse` object (or an iterator of them for streaming). Common fields:

- `content`: cleaned assistant text
- `tool_calls`: structured tool calls (pass-through by default)
- `usage`: token usage (provider-dependent)
- `metadata`: provider/model specific fields (for example extracted reasoning text when configured)

## Model downloads (`download_model`, optional)

`download_model(...)` is an **async generator** that yields `DownloadProgress` updates while a model is being fetched.

Supported providers:
- `ollama`: pulls via the Ollama HTTP API (`/api/pull`)
- `huggingface` / `mlx`: downloads from HuggingFace Hub (requires `pip install "abstractcore[huggingface]"`; pass `token=` for gated models)

Example:

```python
import asyncio
from abstractcore import download_model

async def main():
    async for p in download_model("ollama", "qwen3:4b-instruct-2507-q4_K_M"):
        print(p.status.value, p.message)

asyncio.run(main())
```

Implementation: `abstractcore/download.py`. For provider setup and base URLs, see [Prerequisites](prerequisites.md).

## Tool calling

Tools are passed explicitly to `generate()` / `agenerate()`:

```python
from abstractcore import create_llm, tool

@tool
def get_weather(city: str) -> str:
    return f"{city}: 22°C and sunny"

llm = create_llm("openai", model="gpt-4o-mini")
resp = llm.generate("Use the tool.", tools=[get_weather])
print(resp.tool_calls)
```

See **[Tool Calling](tool-calling.md)** and **[Tool Syntax Rewriting](tool-syntax-rewriting.md)**.

### Built-in tools (optional)

If you want a ready-made toolset (web + filesystem helpers), install:

```bash
pip install "abstractcore[tools]"
```

Then import from `abstractcore.tools.common_tools` (for example `web_search`, `skim_websearch`, `skim_url`, `fetch_url`). See **[Tool Calling](tool-calling.md)** for usage patterns and when to use `skim_*` vs `fetch_*`.

## Structured output

Pass a Pydantic model via `response_model=...` to receive a typed result:

```python
from pydantic import BaseModel
from abstractcore import create_llm

class Answer(BaseModel):
    title: str
    bullets: list[str]

llm = create_llm("openai", model="gpt-4o-mini")
result = llm.generate("Summarize HTTP/3 in 3 bullets.", response_model=Answer)
print(result.bullets)
```

See **[Structured Output](structured-output.md)**.

## Media input

Media handling is opt-in:

```bash
pip install "abstractcore[media]"
```

Then pass `media=[...]` to `generate()` / `agenerate()` (or use the media pipeline). Media behavior is **policy-driven**:

- Images: use a vision-capable model, or configure vision fallback (caption → inject short observations).
- Video: controlled by `video_policy` (native when supported; otherwise frame sampling via `ffmpeg` + vision handling).
- Audio: controlled by `audio_policy` (native when supported; otherwise optional speech-to-text via `abstractvoice`).

See **[Media Handling](media-handling-system.md)**, **[Vision Capabilities](vision-capabilities.md)**, and **[Centralized Config](centralized-config.md)**.

## HTTP API (optional)

If you want an OpenAI-compatible `/v1` gateway, install and run the server:

```bash
pip install "abstractcore[server]"
abstractcore serve
```

See **[Server](server.md)**.

---

### Inlined: `docs/session.md`

# Session Management and Serialization

AbstractCore provides comprehensive session management with complete serialization capabilities, preserving every aspect of your conversations including metadata, tool executions, and optional analytics.

## Overview

A **BasicSession** represents a complete conversation with an LLM, including:
- All messages with timestamps and metadata
- Tool calls and their results (inline with conversation flow)
- Session configuration and settings
- Optional analytics: summary, assessment, and extracted facts

For prompt-cache-aware long chats (reuse stable prefixes like system/tools/files), use **`CachedSession`**:
- See `docs/prompt-caching.md`.

## API Design: Two Methods for Different Purposes

The `BasicSession` provides two main methods for managing conversation history:

### `generate()` - For Normal Conversations (Recommended)
Use this for typical chat interactions where you want the LLM to respond:

```python
# Normal conversation flow
response = session.generate("What is Python?", name="alice")
# This automatically:
# 1. Adds your message to history
# 2. Calls the LLM provider
# 3. Adds the assistant's response to history
# 4. Returns a GenerateResponse object with full metadata

# Access the response data
print(f"Response: {response.content}")           # Generated text
print(f"Tokens used: {response.total_tokens}")  # Token count
print(f"Generation time: {response.gen_time}ms") # Performance metrics
```

### `add_message()` - For Manual History Management
Use this when you need fine-grained control over conversation history:

```python
# Add system messages
session.add_message('system', 'You are a helpful assistant.')

# Add messages without triggering LLM generation
session.add_message('user', 'Hello!', name='alice')
session.add_message('assistant', 'Hi there!')

# Add tool messages
session.add_message('tool', '{"result": "success"}')
```

**Key Difference**: `generate()` triggers LLM response generation, `add_message()` only adds to history.

**Parameter Consistency**: Both methods use `name` parameter, which aligns with the `metadata.name` field in the serialization schema.

## Session Serialization

### Why Serialize Sessions?

Session serialization enables:
- **Persistence**: Save and restore conversations across application restarts
- **Portability**: Share conversations between different environments
- **Analytics**: Generate summaries, assessments, and fact extractions
- **Auditing**: Complete conversation history with tool executions
- **Memory Management**: Load partial conversation windows while preserving full history

### Serialization Format

Sessions are serialized as JSON with a versioned schema for future compatibility:

```json
{
  "schema_version": "session-archive/v1",
  "session": {
    "id": "sess_01J8...",
    "created_at": "2025-10-13T14:52:46Z",
    "provider": "openai",
    "model": "gpt-4o-mini",
    "system_prompt": "You are a helpful AI assistant.",
    "settings": { "auto_compact": true },
    
    "summary": { /* optional */ },
    "assessment": { /* optional */ },
    "facts": { /* optional */ }
  },
  "messages": [ /* complete conversation history */ ]
}
```

### Field Descriptions

#### Session Fields

- **`id`**: Unique session identifier for tracking and correlation
- **`created_at`**: ISO timestamp of session creation
- **`provider`**: LLM provider used (openai, anthropic, ollama, etc.)
- **`model`**: Specific model name (gpt-4o-mini, claude-haiku-4-5, etc.)
- **`model_params`**: Model parameters used (temperature, max_tokens, etc.)
- **`system_prompt`**: The system prompt that guides the assistant's behavior
- **`tool_registry`**: Available tools with their schemas (declarative, no executable code)
- **`settings`**: Session configuration (auto_compact, thresholds, etc.)

#### Optional Analytics Fields

- **`summary`** *(optional)*: Compressed representation of the entire conversation
  - `created_at`: When the summary was generated
  - `preserve_recent`: Number of recent messages preserved during compaction
  - `focus`: Summary focus (e.g., "technical decisions", "key outcomes")
  - `text`: The actual summary content
  - `metrics`: Compression statistics (tokens before/after, ratio)

- **`assessment`** *(optional)*: Quality evaluation of the entire conversation
  - `created_at`: When the assessment was generated
  - `criteria`: Evaluation criteria used (clarity, coherence, relevance, etc.)
  - `overall_score`: Numeric score (typically 1-5)
  - `judge_summary`: Brief assessment summary
  - `strengths`: List of conversation strengths
  - `actionable_feedback`: Suggestions for improvement

- **`facts`** *(optional)*: Extracted facts and knowledge from the conversation
  - `extracted_at`: When facts were extracted
  - `simple_triples`: Array of [subject, predicate, object] fact triples
  - `jsonld`: Optional JSON-LD structured data
  - `statistics`: Extraction statistics (entity count, relationship count)

#### Message Structure

Each message preserves the complete conversational flow:

```json
{
  "id": "msg_01J8...",
  "role": "user|assistant|system|tool",
  "timestamp": "2025-10-13T14:55:20.123Z",
  "content": "Message content",
  "metadata": {
    "name": "alice",
    "location": "London, UK",
    "custom_field": "value"
  }
}
```

**Message Fields:**
- **`id`**: Unique message identifier
- **`role`**: Message role (user, assistant, system, tool)
- **`timestamp`**: When the message was created (auto-generated)
- **`content`**: The actual message content
- **`metadata`**: Flexible container for additional context
  - `name`: Username (defaults to "user" for user messages)
  - `location`: Geographic or contextual location
  - Any additional custom fields

#### Tool Execution Flow

Tool calls are preserved inline with the conversation to maintain sequence:

```json
[
  {
    "role": "assistant",
    "content": "Let me read that file for you.",
    "metadata": {
      "requested_tool_calls": [
        {
          "call_id": "tc_01K",
          "name": "read_file",
          "arguments": { "path": "README.md" }
        }
      ]
    }
  },
  {
    "role": "tool",
    "content": "File contents...",
    "metadata": {
      "call_id": "tc_01K",
      "name": "read_file",
      "arguments": { "path": "README.md" },
      "status": "ok",
      "duration_ms": 120,
      "stderr": null
    }
  }
]
```

This approach:
- Preserves exact execution order
- Links tool calls to results via `call_id`
- Captures execution metadata (duration, status, errors)
- Maintains human-readable conversation flow

## Usage Examples

### Basic Session Persistence

```python
from abstractcore import BasicSession, create_llm

# Create and use session
provider = create_llm("openai", model="gpt-4o-mini")
session = BasicSession(provider, system_prompt="You are a helpful assistant.")

session.add_message('user', 'Hello!', name='alice', location='Paris')
response = session.generate('What is Python?')

# Save complete session
session.save('conversation.json')

# Load session later
loaded_session = BasicSession.load('conversation.json', provider=provider)
```

### Session with Analytics

```python
# Generate optional analytics
session.generate_summary(focus="technical discussion")
session.generate_assessment(criteria=["clarity", "completeness"])
session.extract_facts()

# Save with all analytics
session.save('analyzed_conversation.json')
```

### Memory Window Management

```python
# Get recent messages for LLM context
recent_messages = session.get_window(last_n=10)

# Get messages within time range
today_messages = session.get_window(
    since="2025-10-13T00:00:00Z",
    until="2025-10-13T23:59:59Z"
)

# Get messages with token budget
windowed = session.get_window(
    token_budget=4000,
    include_summary=True  # Prepend summary if context trimmed
)
```

## Schema Reference

The complete JSON schema is available at: `abstractcore/assets/session_schema.json`

This schema can be used for:
- Validation of serialized sessions
- Integration with other tools and systems
- Documentation generation
- API contract definition

## CLI Integration

The CLI provides convenient commands for session management:

```bash
# Save session with basic serialization
/session save my_conversation

# Save with optional analytics
/session save analyzed_session --summary --assessment --facts

# Load session
/session load my_conversation

# Generate analytics on demand
/facts
/judge

# Optional: persist local prompt/KV cache (MLX only)
/cache save chat_cache
/cache save chat_cache --q8
/cache load chat_cache
```

Notes:
- Prompt/KV cache save/load is currently supported only for `MLXProvider` and writes a `.safetensors` file.
- Caches are model-locked; loading a cache resets the transcript and treats the KV cache as the context source of truth.

## Best Practices

### When to Use Analytics

- **Summary**: For long conversations that need compaction
- **Assessment**: For evaluating conversation quality and outcomes
- **Facts**: For knowledge extraction and structured data needs

### Performance Considerations

- Analytics are optional and computed on-demand
- Large conversations benefit from windowing for active memory
- JSON format balances human readability with performance
- Consider compression (gzip/zstd) for very large sessions

### Security and Privacy

- Sessions contain complete conversation history
- Metadata may include sensitive information (usernames, locations)
- If your host/runtime appends tool results into the conversation, those results will be preserved (and may contain file contents, etc.)
- Store sessions securely and consider data retention policies

## Migration and Compatibility

The versioned schema (`session-archive/v1`) ensures:
- Backward compatibility with older session formats
- Forward compatibility through graceful field handling
- Safe evolution of the serialization format

Legacy sessions are automatically migrated on load, preserving all available data while upgrading to the current schema.

---

### Inlined: `docs/async-guide.md`

# Async/Await Guide

Complete guide to using async/await with AbstractCore for concurrent LLM operations.

## Overview

AbstractCore exposes `agenerate()` for async generation across providers.

- **HTTP-based providers** (OpenAI-compatible endpoints, OpenRouter, Ollama, LMStudio, vLLM, etc.) implement native async I/O.
- **In-process local inference** providers (MLX, HuggingFace) use an `asyncio.to_thread()` fallback to avoid blocking the event loop.

Concurrency can improve throughput when requests are **I/O-bound** (network calls). For local inference, throughput is limited by your hardware and the model runtime.

## Provider support

| Provider | Async implementation |
|----------|----------------------|
| `openai`, `anthropic` | Native async SDK clients (when installed) |
| HTTP-based providers (`ollama`, `lmstudio`, `openrouter`, `vllm`, `openai-compatible`, …) | `httpx.AsyncClient` (native async HTTP) |
| `mlx`, `huggingface` | `asyncio.to_thread()` fallback (keeps the event loop responsive) |

## Basic Usage

### Single Async Request

```python
import asyncio
from abstractcore import create_llm

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")

    # Single async request
    response = await llm.agenerate("What is Python?")
    print(response.content)

asyncio.run(main())
```

### Concurrent Requests

```python
import asyncio
from abstractcore import create_llm

async def main():
    llm = create_llm("ollama", model="qwen3:4b")

    # Execute 3 requests concurrently
    tasks = [
        llm.agenerate(f"Summarize {topic}")
        for topic in ["Python", "JavaScript", "Rust"]
    ]

    # Gather runs all tasks concurrently
    responses = await asyncio.gather(*tasks)

    for i, response in enumerate(responses):
        print(f"\n{['Python', 'JavaScript', 'Rust'][i]}:")
        print(response.content)

asyncio.run(main())
```

## Async Streaming

### Basic Streaming

```python
import asyncio
from abstractcore import create_llm

async def main():
    llm = create_llm("anthropic", model="claude-haiku-4-5")

    # Step 1: await the generator
    stream_gen = await llm.agenerate(
        "Write a haiku about coding",
        stream=True
    )

    # Step 2: async for over the chunks
    async for chunk in stream_gen:
        if chunk.content:
            print(chunk.content, end="", flush=True)
    print()

asyncio.run(main())
```

### Concurrent Streaming

```python
import asyncio
from abstractcore import create_llm

async def stream_response(llm, topic, label):
    """Stream a single response with label."""
    print(f"\n{label}:")

    stream_gen = await llm.agenerate(f"Explain {topic} in one sentence", stream=True)

    async for chunk in stream_gen:
        if chunk.content:
            print(chunk.content, end="", flush=True)
    print()

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")

    # Stream 3 responses concurrently
    await asyncio.gather(
        stream_response(llm, "Python", "Python"),
        stream_response(llm, "JavaScript", "JavaScript"),
        stream_response(llm, "Rust", "Rust")
    )

asyncio.run(main())
```

## Session Async

### Async Conversation Management

```python
import asyncio
from abstractcore import create_llm
from abstractcore.core.session import BasicSession

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")
    session = BasicSession(provider=llm)

    # Maintain conversation history with async
    response1 = await session.agenerate("What is Python?")
    print(response1.content)

    response2 = await session.agenerate("What are its main use cases?")
    print(response2.content)

    # Session tracks full conversation history
    print(f"\nConversation length: {len(session.conversation_history)} messages")

asyncio.run(main())
```

### Concurrent Sessions

```python
import asyncio
from abstractcore import create_llm
from abstractcore.core.session import BasicSession

async def chat_session(llm, topic, name):
    """Run independent chat session."""
    session = BasicSession(provider=llm)

    response1 = await session.agenerate(f"What is {topic}?")
    response2 = await session.agenerate("Give me a simple example")

    print(f"\n{name}:")
    print(f"  Question 1: {response1.content[:50]}...")
    print(f"  Question 2: {response2.content[:50]}...")

async def main():
    llm = create_llm("anthropic", model="claude-haiku-4-5")

    # Run 3 independent conversations concurrently
    await asyncio.gather(
        chat_session(llm, "Python", "Session 1"),
        chat_session(llm, "JavaScript", "Session 2"),
        chat_session(llm, "Rust", "Session 3")
    )

asyncio.run(main())
```

## Multi-Provider Comparisons

### Concurrent Provider Queries

```python
import asyncio
from abstractcore import create_llm

async def query_provider(provider_name, model, prompt):
    """Query a single provider."""
    llm = create_llm(provider_name, model=model)
    response = await llm.agenerate(prompt)
    return {
        "provider": provider_name,
        "model": model,
        "response": response.content
    }

async def main():
    prompt = "What is the capital of France?"

    # Query multiple providers simultaneously
    results = await asyncio.gather(
        query_provider("openai", "gpt-4o-mini", prompt),
        query_provider("anthropic", "claude-haiku-4-5", prompt),
        query_provider("ollama", "qwen3:4b", prompt)
    )

    for result in results:
        print(f"\n{result['provider']} ({result['model']}):")
        print(result['response'])

asyncio.run(main())
```

### Provider Consensus

```python
import asyncio
from abstractcore import create_llm

async def main():
    prompt = "Is the Earth flat? Answer yes or no."

    # Get consensus from 3 providers
    llm_openai = create_llm("openai", model="gpt-4o-mini")
    llm_anthropic = create_llm("anthropic", model="claude-haiku-4-5")
    llm_ollama = create_llm("ollama", model="qwen3:4b")

    responses = await asyncio.gather(
        llm_openai.agenerate(prompt),
        llm_anthropic.agenerate(prompt),
        llm_ollama.agenerate(prompt)
    )

    answers = [r.content.strip().lower() for r in responses]
    print(f"Answers: {answers}")
    print(f"Consensus: {'Yes' if answers.count('no') >= 2 else 'No'}")

asyncio.run(main())
```

## FastAPI Integration

### Async HTTP Endpoints

```python
from fastapi import FastAPI
from abstractcore import create_llm

app = FastAPI()
llm = create_llm("openai", model="gpt-4o-mini")

@app.post("/generate")
async def generate(prompt: str):
    """Non-blocking LLM generation endpoint."""
    response = await llm.agenerate(prompt)
    return {"response": response.content}

@app.post("/batch")
async def batch_generate(prompts: list[str]):
    """Process multiple prompts concurrently."""
    tasks = [llm.agenerate(p) for p in prompts]
    responses = await asyncio.gather(*tasks)

    return {
        "responses": [r.content for r in responses]
    }

# Run with: uvicorn your_app:app --reload
```

### Streaming Endpoint

```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from abstractcore import create_llm
import asyncio

app = FastAPI()
llm = create_llm("anthropic", model="claude-haiku-4-5")

async def stream_response(prompt: str):
    """Generate streaming response."""
    stream_gen = await llm.agenerate(prompt, stream=True)

    async for chunk in stream_gen:
        if chunk.content:
            yield f"data: {chunk.content}\n\n"

@app.post("/stream")
async def stream_generate(prompt: str):
    """Streaming LLM generation endpoint."""
    return StreamingResponse(
        stream_response(prompt),
        media_type="text/event-stream"
    )
```

## Batch Document Processing

### Concurrent Document Summaries

```python
import asyncio
from abstractcore import create_llm
from abstractcore.processing import Summarizer

async def summarize_document(summarizer, doc_path):
    """Summarize single document."""
    result = summarizer.summarize(
        input_source=doc_path,
        style="executive",
        length="brief"
    )
    return {
        "path": doc_path,
        "summary": result.summary
    }

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")
    summarizer = Summarizer(llm)

    documents = [
        "report1.pdf",
        "report2.pdf",
        "report3.pdf"
    ]

    # Summarize all documents concurrently
    tasks = [summarize_document(summarizer, doc) for doc in documents]
    results = await asyncio.gather(*tasks)

    for result in results:
        print(f"\n{result['path']}:")
        print(result['summary'])

asyncio.run(main())
```

## Error Handling

### Graceful Error Recovery

```python
import asyncio
from abstractcore import create_llm
from abstractcore.exceptions import ProviderAPIError

async def safe_generate(llm, prompt, label):
    """Generate with error handling."""
    try:
        response = await llm.agenerate(prompt)
        return {"label": label, "content": response.content, "error": None}
    except ProviderAPIError as e:
        return {"label": label, "content": None, "error": str(e)}

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")

    # Some requests may fail - continue processing others
    results = await asyncio.gather(
        safe_generate(llm, "Valid prompt 1", "Task 1"),
        safe_generate(llm, "Valid prompt 2", "Task 2"),
        safe_generate(llm, "Valid prompt 3", "Task 3")
    )

    for result in results:
        if result["error"]:
            print(f"{result['label']}: ERROR - {result['error']}")
        else:
            print(f"{result['label']}: {result['content']}")

asyncio.run(main())
```

## Practical tips

### 1. Prefer native-async providers when possible

```python
# ✅ Native async HTTP (I/O-bound)
llm = create_llm("ollama", model="qwen3:4b")

# ✅ Native async SDK (cloud APIs)
llm = create_llm("openai", model="gpt-4o-mini")

# ⚠️ Fallback: runs sync generation in a thread (keeps the event loop responsive)
llm = create_llm("mlx", model="mlx-community/Qwen3-4B-4bit")
```

### 2. Batch Similar Operations

```python
# ✅ GOOD: Single gather for all tasks
tasks = [llm.agenerate(f"Task {i}") for i in range(10)]
results = await asyncio.gather(*tasks)

# ❌ BAD: Sequential awaits lose concurrency benefit
results = []
for i in range(10):
    result = await llm.agenerate(f"Task {i}")
    results.append(result)
```

### 3. Mix Async with Sync I/O

```python
import asyncio
from abstractcore import create_llm

async def main():
    llm = create_llm("anthropic", model="claude-haiku-4-5")

    # Concurrent: LLM generation + file I/O
    llm_task = llm.agenerate("Explain async")
    file_task = asyncio.to_thread(read_large_file, "data.txt")

    response, data = await asyncio.gather(llm_task, file_task)
    # Both completed concurrently!
```

## Common Patterns

### Retry with Exponential Backoff

```python
import asyncio
from abstractcore import create_llm

async def generate_with_retry(llm, prompt, max_retries=3):
    """Generate with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            return await llm.agenerate(prompt)
        except Exception as e:
            if attempt == max_retries - 1:
                raise

            wait_time = 2 ** attempt
            print(f"Retry {attempt + 1} after {wait_time}s...")
            await asyncio.sleep(wait_time)

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")
    response = await generate_with_retry(llm, "What is Python?")
    print(response.content)

asyncio.run(main())
```

### Rate Limiting

```python
import asyncio
from abstractcore import create_llm

class RateLimiter:
    def __init__(self, max_per_second):
        self.max_per_second = max_per_second
        self.semaphore = asyncio.Semaphore(max_per_second)
        self.reset_task = None

    async def acquire(self):
        await self.semaphore.acquire()

        # Release after 1 second
        if not self.reset_task or self.reset_task.done():
            self.reset_task = asyncio.create_task(self._release_after_delay())

    async def _release_after_delay(self):
        await asyncio.sleep(1.0)
        self.semaphore.release()

async def main():
    llm = create_llm("openai", model="gpt-4o-mini")
    limiter = RateLimiter(max_per_second=5)

    # Process 20 requests with 5 requests/second limit
    async def limited_generate(prompt):
        await limiter.acquire()
        return await llm.agenerate(prompt)

    tasks = [limited_generate(f"Task {i}") for i in range(20)]
    results = await asyncio.gather(*tasks)

asyncio.run(main())
```

### Progress Tracking

```python
import asyncio
from abstractcore import create_llm

async def generate_with_progress(llm, prompts):
    """Generate with real-time progress tracking."""
    completed = 0
    total = len(prompts)

    async def track_task(prompt):
        nonlocal completed
        response = await llm.agenerate(prompt)
        completed += 1
        print(f"Progress: {completed}/{total} ({completed/total*100:.1f}%)")
        return response

    tasks = [track_task(p) for p in prompts]
    return await asyncio.gather(*tasks)

async def main():
    llm = create_llm("ollama", model="qwen3:4b")
    prompts = [f"Task {i}" for i in range(10)]

    results = await generate_with_progress(llm, prompts)
    print(f"\nCompleted {len(results)} tasks!")

asyncio.run(main())
```

## Why MLX/HuggingFace Use Fallback

MLX and HuggingFace providers use `asyncio.to_thread()` fallback because:

1. **No Async Library APIs**: Neither `mlx_lm` nor `transformers` expose async Python APIs
2. **Direct Function Calls**: No HTTP layer to enable concurrent I/O
3. **Industry Standard**: Same pattern used by LangChain, Pydantic-AI for CPU-bound operations
4. **Event Loop Responsive**: Fallback keeps event loop responsive for mixing with I/O

```python
# MLX/HF async example (fallback keeps event loop responsive)
import asyncio
from abstractcore import create_llm

async def main():
    llm = create_llm("mlx", model="mlx-community/Qwen3-4B-4bit")

    # Can mix MLX inference with async I/O
    inference_task = llm.agenerate("What is Python?")
    io_task = fetch_data_from_api()  # Async I/O

    # Both run concurrently - event loop not blocked!
    response, data = await asyncio.gather(inference_task, io_task)

asyncio.run(main())
```

If you run local inference behind an OpenAI-compatible HTTP server (for example, via LM Studio), you can use the `lmstudio` (or `openai-compatible`) provider for native async I/O to the server:

```python
llm = create_llm("lmstudio", model="local-model", base_url="http://localhost:1234/v1")
```

## Best Practices

### 1. Always Use asyncio.gather() for Concurrent Tasks

```python
# ✅ CORRECT: All tasks run concurrently
results = await asyncio.gather(*[llm.agenerate(p) for p in prompts])

# ❌ WRONG: Sequential execution (no concurrency)
results = [await llm.agenerate(p) for p in prompts]
```

### 2. Await Stream Generator First

```python
# ✅ CORRECT: Two-step pattern
stream_gen = await llm.agenerate(prompt, stream=True)
async for chunk in stream_gen:
    print(chunk.content, end="")

# ❌ WRONG: Missing await before async for
async for chunk in llm.agenerate(prompt, stream=True):  # Error!
    print(chunk.content, end="")
```

### 3. Close Resources Properly

```python
# ✅ GOOD: Clean shutdown
llm = create_llm("openai", model="gpt-4o-mini")
try:
    response = await llm.agenerate("Test")
finally:
    llm.unload_model(llm.model)  # Closes async client
```

### 4. Handle Errors in Concurrent Operations

```python
# ✅ GOOD: Catch errors per-task
async def safe_task(prompt):
    try:
        return await llm.agenerate(prompt)
    except Exception as e:
        return f"Error: {e}"

results = await asyncio.gather(*[safe_task(p) for p in prompts])
```

## Learning Resources

- **Educational Demo**: [examples/cli/async_cli_demo.py](../examples/cli/async_cli_demo.py) - 8 core async/await patterns
- **Test Suite**: `tests/async/test_async_providers.py` - real implementation examples
- **Concurrency & Throughput**: [concurrency.md](concurrency.md) - practical guidance for local inference

## Summary

- ✅ `agenerate()` works across providers
- ✅ Use `asyncio.gather()` for concurrent (I/O-bound) requests
- ✅ HTTP-based providers use native async; MLX/HuggingFace use a thread fallback to keep the event loop responsive
- ✅ Async streaming uses a 2-step pattern: `stream_gen = await llm.agenerate(..., stream=True)` then `async for ...`
- ✅ Works well in FastAPI and other async frameworks

**Get Started**:
```bash
pip install abstractcore

# Try the educational async demo
python examples/cli/async_cli_demo.py --provider ollama --model qwen3:4b
```

---

### Inlined: `docs/tool-calling.md`

# Tool Calling System

AbstractCore provides a universal tool calling system that works across all LLM providers, even those without native tool support.

## Table of Contents

- [Quick Start](#quick-start)
- [The @tool Decorator](#the-tool-decorator)
- [Universal Tool Support](#universal-tool-support)
- [Tool Definition](#tool-definition)
- [Tool Execution](#tool-execution)
- [Advanced Patterns](#advanced-patterns)
- [Error Handling](#error-handling)
- [Performance Optimization](#performance-optimization)
- [Tool Syntax Rewriting](#tool-syntax-rewriting)
- [Event System Integration](#event-system-integration)

## Quick Start

The simplest way to create and use tools is with the `@tool` decorator:

```python
from abstractcore import create_llm, tool

@tool
def get_weather(city: str) -> str:
    """Get current weather for a specified city."""
    # In a real scenario, you'd call an actual weather API
    return f"The weather in {city} is sunny, 72°F"

@tool
def calculate(expression: str) -> float:
    """Perform a mathematical calculation."""
    try:
        result = eval(expression)  # Simplified for demo - don't use eval in production!
        return result
    except Exception:
        return float('nan')

# Works with ANY provider
llm = create_llm("openai", model="gpt-4o-mini")
response = llm.generate(
    "What's the weather in Tokyo and what's 15 * 23?",
    tools=[get_weather, calculate]  # Pass functions directly
)

print(response.content)

# By default (`execute_tools=False`), AbstractCore does not execute tools.
# Instead, it returns structured tool calls (if the model chose to call tools).
print(f"Tool calls requested: {len(response.tool_calls) if response.tool_calls else 0}")
print(f"Generation time: {response.gen_time}ms")
print(f"Summary: {response.get_summary()}")  # Includes tool count

# Inspect tool calls (host/runtime executes them)
if response.tool_calls:
    for call in response.tool_calls:
        print(f"Tool: {call.get('name')} args={call.get('arguments')}")
```

## The @tool Decorator

The `@tool` decorator is the primary way to create tools in AbstractCore. It automatically extracts function metadata and creates proper tool definitions.

### Basic Usage

```python
from abstractcore import tool

@tool
def list_files(directory: str = ".", pattern: str = "*") -> str:
    """List files in a directory matching a pattern."""
    import os
    import fnmatch
    
    try:
        files = [f for f in os.listdir(directory) 
                if fnmatch.fnmatch(f, pattern)]
        return "\n".join(files) if files else "No files found"
    except Exception as e:
        return f"Error: {str(e)}"
```

### Type Annotations

The decorator automatically infers parameter types from type annotations:

```python
@tool
def create_user(name: str, age: int, is_admin: bool = False) -> str:
    """Create a new user with the specified details."""
    user_data = {
        "name": name,
        "age": age,
        "is_admin": is_admin,
        "created_at": "2025-01-14"
    }
    return f"Created user: {user_data}"
```

### Enhanced Metadata

The `@tool` decorator supports rich metadata that gets automatically injected into system prompts for prompted models and passed to native APIs:

```python
@tool(
    description="Search the database for records matching the query",
    tags=["database", "search", "query"],
    when_to_use="When the user asks for specific data from the database or wants to find records",
    examples=[
        {
            "description": "Find all users named John",
            "arguments": {
                "query": "name=John",
                "table": "users"
            }
        },
        {
            "description": "Search for products under $50",
            "arguments": {
                "query": "price<50", 
                "table": "products"
            }
        },
        {
            "description": "Find recent orders",
            "arguments": {
                "query": "date>2025-01-01",
                "table": "orders"
            }
        }
    ]
)
def search_database(query: str, table: str = "users") -> str:
    """Search the database for records matching the query."""
    # Implementation here
    return f"Searching {table} for: {query}"
```

**How This Metadata is Used:**
- **Prompted tool calling**: the tool formatter injects tool name/description/args into the system prompt. To keep prompts small, `when_to_use` is included only for small tool sets and a few high-impact tools (edit/write/execute + web triage tools), and tool examples are globally capped.
- **Native tool calling**: only standard fields (`name`, `description`, `parameters`) are sent to provider APIs (unknown/custom fields are intentionally omitted for compatibility).

### Built-in Tools

AbstractCore includes a comprehensive set of ready-to-use tools in `abstractcore.tools.common_tools` (requires `pip install "abstractcore[tools]"`):

```python
from abstractcore.tools.common_tools import skim_url, fetch_url, search_files, read_file, list_files

# Quick URL preview (fast, small)
preview = skim_url("https://example.com/article")

# Full web content fetching and parsing (HTML→Markdown, JSON/XML/text)
result = fetch_url("https://api.github.com/repos/python/cpython")
# For PDFs/images/other binaries, fetch_url returns metadata (and optional previews), not full extraction.

# File system operations  
files = search_files("def.*fetch", ".", file_pattern="*.py")
content = read_file("config.json")
directory_listing = list_files(".", pattern="*.py", recursive=True)
```

**Available Built-in Tools:**
- `skim_url` - Fast URL skim (title/description/headings + short preview)
- `fetch_url` - Fetch + parse common text-first types (HTML→Markdown, JSON/XML/text); binaries return metadata + optional previews
- `search_files` - Search for text patterns inside files using regex
- `list_files` - Find and list files by names/paths using glob patterns
- `read_file` - Read file contents with optional line range selection
- `write_file` - Write content to files with directory creation
- `edit_file` - Edit files using pattern matching and replacement
- `web_search` - Search the web using DuckDuckGo
- `skim_websearch` - Smaller/filtered web search (compact result list)
- `execute_command` - Execute shell commands safely with security controls

**Suggested web workflow (agent-friendly):**
1. `skim_websearch(...)` → get a small set of candidate URLs
2. `skim_url(...)` → quickly decide what’s worth fetching
3. `fetch_url(...)` → parse the selected URL(s); set `include_full_content=False` when you want a smaller output

Tip: you can measure output footprint/latency locally with `python examples/tools/skim_tools_benchmark.py --help`.

### Real-World Example

Here's an example from AbstractCore's codebase showing the enhanced `@tool` decorator:

```python
@tool(
    description="Find and list files and directories by their names/paths using glob patterns (case-insensitive, supports multiple patterns)",
    tags=["file", "directory", "listing", "filesystem"],
    when_to_use="When you need to find files by their names, paths, or file extensions (NOT for searching file contents)",
    examples=[
        {
            "description": "List all files in current directory",
            "arguments": {
                "directory_path": ".",
                "pattern": "*"
            }
        },
        {
            "description": "Find all Python files recursively",
            "arguments": {
                "directory_path": ".",
                "pattern": "*.py",
                "recursive": True
            }
        },
        {
            "description": "Find all files with 'test' in filename (case-insensitive)",
            "arguments": {
                "directory_path": ".",
                "pattern": "*test*",
                "recursive": True
            }
        },
        {
            "description": "Find multiple file types using | separator",
            "arguments": {
                "directory_path": ".",
                "pattern": "*.py|*.js|*.md",
                "recursive": True
            }
        },
        {
            "description": "Complex multiple patterns - documentation, tests, and config files",
            "arguments": {
                "directory_path": ".",
                "pattern": "README*|*test*|config.*|*.yml",
                "recursive": True
            }
        }
    ]
)
def list_files(directory_path: str = ".", pattern: str = "*", recursive: bool = False, include_hidden: bool = False, head_limit: Optional[int] = 50) -> str:
    """
    List files and directories in a specified directory with pattern matching (case-insensitive).

    IMPORTANT: Use 'directory_path' parameter (not 'file_path') to specify the directory to list.

    Args:
        directory_path: Path to the directory to list files from (default: "." for current directory)
        pattern: Glob pattern(s) to match files. Use "|" to separate multiple patterns (default: "*")
        recursive: Whether to search recursively in subdirectories (default: False)
        include_hidden: Whether to include hidden files/directories starting with '.' (default: False)
        head_limit: Maximum number of files to return (default: 50, None for unlimited)

    Returns:
        Formatted string with file and directory listings or error message.
        When head_limit is applied, shows "showing X of Y files" in the header.
    """
    # Implementation here...
```

**How This Gets Transformed**

When you use this tool with a prompted model (like Ollama), AbstractCore automatically generates a system prompt like this:

```
You are a helpful AI assistant with access to the following tools:

**list_files**: Find and list files and directories by their names/paths using glob patterns (case-insensitive, supports multiple patterns)
• When to use: When you need to find files by their names, paths, or file extensions (NOT for searching file contents)
• Tags: file, directory, listing, filesystem
• Parameters: {"directory_path": {"type": "string", "default": "."}, "pattern": {"type": "string", "default": "*"}, ...}

To use a tool, respond with this EXACT format:
<|tool_call|>
{"name": "tool_name", "arguments": {"param1": "value1", "param2": "value2"}}
</|tool_call|>

**EXAMPLES:**

**list_files Examples:**
1. List all files in current directory:
<|tool_call|>
{"name": "list_files", "arguments": {"directory_path": ".", "pattern": "*"}}
</|tool_call|>

2. Find all Python files recursively:
<|tool_call|>
{"name": "list_files", "arguments": {"directory_path": ".", "pattern": "*.py", "recursive": true}}
</|tool_call|>

... and 3 more examples with proper formatting ...
```

## Universal Tool Support

AbstractCore's tool system works across all providers through two mechanisms:

### Control Tokens vs Tool Transcript Tags (Important)

It’s easy to conflate two separate layers:

1) **Chat-template control tokens** (provider responsibility)
   - These are the hidden/model-specific role separators that turn `{role:"system"}` vs `{role:"user"}` into the model’s expected prompt template.
   - Examples (model-dependent): Llama role headers, Qwen `im_start` blocks, etc.
   - When you use a messages API (OpenAI-compatible, Anthropic, Ollama, LMStudio), the server usually applies these automatically.

2) **Tool-call transcript tags** (prompted strategy)
   - These are literal strings the model emits in *assistant content* that we parse, such as:
     - Qwen-style: `<|tool_call|>…</|tool_call|>`
     - LLaMA-style: `<function_call>…</function_call>`
     - XML-ish: `<tool_call>…</tool_call>`
   - They may correspond to special tokens in some tokenizers, but in prompted mode we still treat them as transcript text and parse them from the output.

Native tool calling uses structured request/response fields (`tools` / `tool_calls` / Anthropic `tool_use`) and relies on the provider/server to apply the correct chat template; prompted tool calling describes tools in the system prompt and expects transcript tags in assistant text.

### 1. Native Tool Support

For providers with native tool APIs (OpenAI, Anthropic):

```python
# OpenAI with native tool support
llm = create_llm("openai", model="gpt-4o-mini")
response = llm.generate("What's the weather?", tools=[get_weather])
```

### 2. Intelligent Prompting

For providers without native tool support (Ollama, MLX, LMStudio):

```python
# Ollama without native tool support - AbstractCore handles this automatically
llm = create_llm("ollama", model="qwen3:4b-instruct")
response = llm.generate("What's the weather?", tools=[get_weather])
# AbstractCore automatically:
# 1. Detects the model architecture (Qwen3)
# 2. Formats tools with examples into system prompt
# 3. Parses tool calls from response using <|tool_call|> format
# 4. Returns structured tool call requests in response.tool_calls
```

## Tool Definition

Tools are defined using the `ToolDefinition` class, but the `@tool` decorator handles this automatically:

```python
from abstractcore.tools import ToolDefinition

# Manual tool definition (rarely needed)
tool_def = ToolDefinition(
    name="get_weather",
    description="Get current weather for a city",
    parameters={
        "city": {
            "type": "string",
            "description": "The city name"
        }
    },
    function=get_weather_function
)
```

### Parameter Types

Supported parameter types:

- `string` - Text values
- `integer` - Whole numbers
- `number` - Floating-point numbers
- `boolean` - True/false values
- `array` - Lists of values
- `object` - Complex nested structures

```python
@tool
def complex_tool(
    text: str,
    count: int,
    threshold: float,
    enabled: bool,
    tags: list,
    config: dict
) -> str:
    """Tool with various parameter types."""
    return f"Processed: {text} with {count} items"
```

## Tool Execution

### Execution Modes

- **Passthrough mode (default)**: `execute_tools=False`
  - AbstractCore returns structured tool calls in `GenerateResponse.tool_calls`.
  - By default (`tool_call_tags is None`), tool-call markup is stripped from `GenerateResponse.content`.
  - A host/runtime executes tools (recommended for servers and agent loops).

- **Direct execution mode (deprecated)**: `execute_tools=True`
  - AbstractCore parses and executes tools locally via the tool registry and appends results to `content`.
  - Intended for simple scripts only; avoid in server/untrusted environments.

### Architecture-Aware Tool Call Detection

AbstractCore automatically detects model architecture and uses the appropriate tool call format:

| Architecture | Format | Example |
|-------------|--------|---------|
| **Qwen3** | `<|tool_call|>...JSON...</|tool_call|>` | `<|tool_call|>{"name": "get_weather", "arguments": {"city": "Paris"}}</|tool_call|>` |
| **LLaMA3** | `<function_call>...JSON...</function_call>` | `<function_call>{"name": "get_weather", "arguments": {"city": "Paris"}}</function_call>` |
| **OpenAI/Anthropic** | Native API tool calls | Structured JSON in API response |
| **XML-based** | `<tool_call>...JSON...</tool_call>` | `<tool_call>{"name": "get_weather", "arguments": {"city": "Paris"}}</tool_call>` |

**Note:** AbstractCore handles architecture detection, prompt formatting, and response parsing automatically. Your tools work the same way across all providers.

### Execution Responsibility (Recommended)

In passthrough mode, `response.tool_calls` are tool call *requests*. Execute them in your host/runtime (and apply your own safety policy) before sending tool results back to the model in a follow-up turn.

## Advanced Patterns

### Tool Chaining

Tools can call other tools or return data that triggers additional tool calls:

```python
@tool
def get_user_location(user_id: str) -> str:
    """Get the location of a user."""
    # Simulated implementation
    locations = {"user123": "Paris", "user456": "Tokyo"}
    return locations.get(user_id, "Unknown")

@tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    return f"Weather in {city}: 72°F, sunny"

# LLM can chain these tools:
response = llm.generate(
    "What's the weather like for user123?",
    tools=[get_user_location, get_weather]
)
# In an agent loop, your host/runtime can execute tool calls and feed tool results back into the model for multi-step chaining.
```

### Conditional Tool Execution (Recommended)

In passthrough mode, your host/runtime decides which tool calls to execute:

```python
from abstractcore.tools import ToolCall, ToolRegistry

dangerous_tools = {"delete_file", "system_command", "send_email"}

registry = ToolRegistry()
registry.register(get_user_location)
registry.register(get_weather)

response = llm.generate("What's the weather like for user123?", tools=[get_user_location, get_weather])

for call in response.tool_calls or []:
    name = call.get("name")
    if name in dangerous_tools:
        continue
    result = registry.execute_tool(
        ToolCall(
            name=name,
            arguments=call.get("arguments") or {},
            call_id=call.get("call_id") or call.get("id"),
        )
    )
    print(result)
```

### Async Tool Support

For tools that need to perform async operations:

```python
import asyncio

@tool
def fetch_data(url: str) -> str:
    """Fetch data from a URL."""
    async def async_fetch():
        # Simulate async HTTP request
        await asyncio.sleep(0.1)
        return f"Data from {url}"
    
    # Run async function in sync context
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    try:
        result = loop.run_until_complete(async_fetch())
        return result
    finally:
        loop.close()
```

## Error Handling

### Tool-Level Error Handling

Handle errors within tools:

```python
@tool
def safe_division(a: float, b: float) -> str:
    """Safely divide two numbers."""
    try:
        if b == 0:
            return "Error: Division by zero is not allowed"
        result = a / b
        return f"{a} ÷ {b} = {result}"
    except Exception as e:
        return f"Error: {str(e)}"
```

### System-Level Error Handling

AbstractCore provides comprehensive error handling:

```python
from abstractcore.exceptions import ToolExecutionError

try:
    response = llm.generate("Use the broken tool", tools=[broken_tool])
except ToolExecutionError as e:
    print(f"Tool execution failed: {e}")
    print(f"Failed tool: {e.tool_name}")
    print(f"Error details: {e.error_details}")
```

### Validation and Sanitization

Validate tool inputs:

```python
@tool
def create_file(filename: str, content: str) -> str:
    """Create a file with the given content."""
    import os
    import re
    
    # Validate filename
    if not re.match(r'^[a-zA-Z0-9_.-]+$', filename):
        return "Error: Invalid filename. Use only letters, numbers, dots, dashes, and underscores."
    
    # Prevent directory traversal
    if '..' in filename or filename.startswith('/'):
        return "Error: Invalid filename. No directory traversal allowed."
    
    try:
        with open(filename, 'w') as f:
            f.write(content)
        return f"File '{filename}' created successfully"
    except Exception as e:
        return f"Error creating file: {str(e)}"
```

## Performance Optimization

### Tool Registry

Use the tool registry for better performance with many tools:

```python
from abstractcore.tools import ToolRegistry, register_tool

# Register tools globally
register_tool(get_weather)
register_tool(calculate)
register_tool(list_files)

# Use registered tools
registry = ToolRegistry.get_instance()
available_tools = registry.get_all_tools()

response = llm.generate(
    "Help me with weather and calculations",
    tools=available_tools
)
```

### Lazy Loading

Load expensive resources only when needed:

```python
class DatabaseTool:
    def __init__(self):
        self._connection = None
    
    @property
    def connection(self):
        if self._connection is None:
            # Expensive database connection
            self._connection = create_database_connection()
        return self._connection

db_tool = DatabaseTool()

@tool
def query_database(sql: str) -> str:
    """Execute a SQL query."""
    try:
        result = db_tool.connection.execute(sql)
        return str(result)
    except Exception as e:
        return f"Database error: {str(e)}"
```

### Caching Results

Cache expensive tool results:

```python
from functools import lru_cache

@tool
@lru_cache(maxsize=100)
def expensive_calculation(input_data: str) -> str:
    """Perform an expensive calculation with caching."""
    import time
    time.sleep(1)  # Simulate expensive operation
    return f"Result for {input_data}"
```

## Tool Syntax Rewriting

AbstractCore can rewrite tool-call syntax for downstream agents/clients:

- **Python API**: pass `tool_call_tags=...` to `generate()` / `agenerate()` / `BasicSession.generate()` to preserve and rewrite tool-call markup in `content`.
- **HTTP server**: set the `agent_format` request field (or rely on auto-detection based on `User-Agent` + model name).

See: [Tool Call Syntax Rewriting](tool-syntax-rewriting.md)

## Event System Integration

Observe tool calling and (optional) tool execution through events:

### Cost Monitoring

```python
from abstractcore.events import EventType, on_global

def monitor_tool_costs(event):
    """Monitor costs of tool executions."""
    for call in event.data.get("tool_calls", []) or []:
        if call.get("name") in {"expensive_api_call", "premium_service"}:
            print(f"Warning: Using expensive tool {call.get('name')}")

on_global(EventType.TOOL_STARTED, monitor_tool_costs)
```

### Performance Tracking

```python
def track_tool_performance(event):
    """Track tool execution outcomes (shape varies by emitter)."""
    for result in event.data.get("tool_results", []) or []:
        if result.get("success") is False:
            print(f"Tool failed: {result.get('name')} error={result.get('error')}")

on_global(EventType.TOOL_COMPLETED, track_tool_performance)
```

### Security Auditing

```python
def audit_tool_usage(event):
    """Audit all tool usage for security."""
    for call in event.data.get("tool_calls", []) or []:
        print(f"Tool requested: {call.get('name')} args={call.get('arguments')}")
        # Log to security audit system
        security_log(call.get("name"), call.get("arguments"))

on_global(EventType.TOOL_STARTED, audit_tool_usage)
```

## Best Practices

### 1. Clear Documentation

Always provide clear docstrings for your tools:

```python
@tool
def send_email(to: str, subject: str, body: str) -> str:
    """Send an email to the specified recipient.
    
    Args:
        to: Email address of the recipient
        subject: Subject line of the email
        body: Main content of the email
    
    Returns:
        Success message or error description
    
    Note:
        This tool requires email configuration to be set up.
        Use with caution as it sends actual emails.
    """
    # Implementation here
```

### 2. Input Validation

Always validate and sanitize inputs:

```python
@tool
def process_user_input(user_data: str) -> str:
    """Process user input safely."""
    # Validate input length
    if len(user_data) > 1000:
        return "Error: Input too long (max 1000 characters)"
    
    # Sanitize input
    import html
    safe_data = html.escape(user_data)
    
    # Process safely
    return f"Processed: {safe_data}"
```

### 3. Error Recovery

Provide meaningful error messages and recovery suggestions:

```python
@tool
def connect_to_service(endpoint: str) -> str:
    """Connect to an external service."""
    try:
        # Attempt connection
        result = make_connection(endpoint)
        return f"Connected successfully: {result}"
    except ConnectionError:
        return "Error: Could not connect to service. Please check the endpoint URL and try again."
    except TimeoutError:
        return "Error: Connection timed out. The service may be temporarily unavailable."
    except Exception as e:
        return f"Error: Unexpected error occurred: {str(e)}"
```

### 4. Resource Management

Clean up resources properly:

```python
@tool
def process_large_file(filename: str) -> str:
    """Process a large file efficiently."""
    try:
        with open(filename, 'r') as file:
            # Process file in chunks
            result = ""
            while True:
                chunk = file.read(1024)
                if not chunk:
                    break
                result += process_chunk(chunk)
        return f"Processed file: {filename}"
    except FileNotFoundError:
        return f"Error: File '{filename}' not found"
    except MemoryError:
        return "Error: File too large to process"
```

## Troubleshooting

### Common Issues

1. **Tool not being called**: Check tool description and parameter names
2. **Invalid JSON in tool calls**: Ensure proper error handling in tools
3. **Tools timing out**: Implement proper timeout handling
4. **Memory issues with large tools**: Use streaming or chunking

### Debug Mode

Enable debug mode to see tool execution details:

```python
import logging
logging.basicConfig(level=logging.DEBUG)

# Tool execution details will be logged
response = llm.generate("Use tools", tools=[debug_tool])
```

### Testing Tools

Test tools independently:

```python
# Test tool directly
result = get_weather("Paris")
print(f"Direct call result: {result}")

# Test with LLM
response = llm.generate("What's the weather in Paris?", tools=[get_weather])
print(f"LLM result: {response.content}")
```

## Examples

See the [examples directory](../examples/) for comprehensive tool usage examples:

- [Basic Tool Usage](../examples/tools/tool_usage_basic.py)
- [Advanced Tool Patterns](../examples/tools/tool_usage_advanced.py)
- [Tool Chaining Examples](../examples/learning_path/03_tool_calling.py)

## Related Documentation

- [API Reference](api-reference.md) - Complete API documentation
- [Event System](api-reference.md#event-system) - Event-driven tool control
- [Architecture](architecture.md) - System design and tool execution flow
- [Server Guide](server.md) - HTTP server and REST API
- [Getting Started](getting-started.md) - Quick start guide

---

### Inlined: `docs/tool-syntax-rewriting.md`

# Tool Call Syntax Rewriting

AbstractCore can **convert tool-call syntax** to help different runtimes/clients consume tool calls consistently.

There are two related but distinct features:

1. **Python API (`tool_call_tags`)**: preserve and rewrite *tool-call markup inside assistant content* (mostly for prompted-tool models).
2. **HTTP Server (`agent_format`)**: convert/synthesize tool-call syntax for HTTP clients (Codex, other agentic CLIs), while keeping `tool_calls` structured.

## 1) Python API: `tool_call_tags` (per-call)

`tool_call_tags` is passed to `generate()` / `agenerate()` / `BasicSession.generate()` as a **per-call kwarg**.

### Default behavior (recommended)

- When `tool_call_tags is None` (default):
  - `response.tool_calls` is populated when tool calls are detected (native tools or prompted tags).
  - Tool-call markup is stripped from `response.content` for clean UX/history.

### When to set `tool_call_tags`

Set `tool_call_tags` when you want **tool-call markup kept in `content`** so a downstream consumer can parse it from text.

This is most useful for **prompted-tool** providers (tool calls are emitted in assistant content), e.g.:
- `ollama`
- `lmstudio`
- `mlx`
- `huggingface`
- `openai-compatible` (and compatible endpoints like vLLM / LM Studio)

For **native tool** providers (OpenAI/Anthropic), tool calls are primarily consumed from `response.tool_calls` (structured), not from tags embedded in `content`.

### Supported values

- Predefined formats:
  - `qwen3` → `<|tool_call|>...JSON...</|tool_call|>`
  - `llama3` → `<function_call>...JSON...</function_call>`
  - `xml` → `<tool_call>...JSON...</tool_call>`
  - `gemma` → ```tool_code\n...JSON...\n```
- Custom tags:
  - Comma-separated start/end: `"START,END"` or `"[TOOL],[/TOOL]"`
  - Single tag name: `"MYTAG"` → `<MYTAG>...JSON...</MYTAG>`

### Example (non-streaming)

```python
from abstractcore import create_llm

tool = {
    "name": "get_weather",
    "description": "Get weather for a city",
    "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]},
}

llm = create_llm("ollama", model="qwen3:4b-instruct")
response = llm.generate(
    "Weather in Paris?",
    tools=[tool],
    tool_call_tags="llama3",
)

print(response.content)     # contains <function_call>...</function_call>
print(response.tool_calls)  # always structured dicts for host/runtime execution
```

### Example (streaming)

```python
tool_calls = []
for chunk in llm.generate(
    "Weather in Paris?",
    tools=[tool],
    stream=True,
    tool_call_tags="llama3",
):
    print(chunk.content, end="", flush=True)
    if chunk.tool_calls:
        tool_calls.extend(chunk.tool_calls)
```

## 2) HTTP Server: `agent_format`

When using the AbstractCore server (`/v1/chat/completions`), you can request a target tool-call syntax via `agent_format`.

- `agent_format` affects how tool calls are represented in the response for a given client.
- The server always runs in passthrough mode (`execute_tools=False`): it returns tool calls; it does not execute them.

### Supported values

- `auto` (default): auto-detect based on `User-Agent` + model name patterns
- `openai`
- `codex`
- `qwen3`
- `llama3`
- `xml`
- `gemma`
- `passthrough`

### Example

```bash
curl -sS http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/qwen3:4b-instruct",
    "messages": [{"role": "user", "content": "Weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a city",
        "parameters": {"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}
      }
    }],
    "agent_format": "codex"
  }'
```

## Notes

- `tool_call_tags` is **formatting**, not execution: it only changes how tool calls are represented in `content`.
- The canonical machine-readable representation remains `GenerateResponse.tool_calls` (Python) or `message.tool_calls` (server/OpenAI format).

---

### Inlined: `docs/structured-output.md`

# Structured Output

AbstractCore implements structured output generation using Pydantic models with automatic schema validation and provider-specific optimizations. The system employs a dual-strategy architecture that adapts to provider capabilities, delivering reliable schema compliance across all supported LLM providers.

---

## Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Provider Implementation](#provider-implementation)
4. [Usage Guide](#usage-guide)
5. [Schema Design](#schema-design)
6. [Performance Characteristics](#performance-characteristics)
7. [Error Handling](#error-handling)
8. [Production Deployment](#production-deployment)
9. [API Reference](#api-reference)

---

## Overview

### What is Structured Output?

Structured output constrains LLM responses to conform to predefined schemas, enabling direct deserialization into typed objects. AbstractCore uses Pydantic BaseModel classes to define schemas and validate responses.

### Basic Example

```python
from abstractcore import create_llm
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int
    email: str

llm = create_llm("openai", model="gpt-4o-mini")
person = llm.generate(
    "Extract: John Doe, 35 years old, john@example.com",
    response_model=Person
)

# person is a validated Person instance
assert isinstance(person, Person)
assert person.name == "John Doe"
assert person.age == 35
```

### Key Benefits

- **Type Safety**: Responses are validated Pydantic instances with full IDE support
- **Schema Compliance**: Automatic validation ensures data conforms to defined structure
- **Provider Agnostic**: Identical API across OpenAI, Anthropic, Ollama, LMStudio, HuggingFace, MLX
- **Automatic Strategy Selection**: Framework selects optimal implementation based on provider capabilities
- **Test Coverage**: Supported strategies are exercised by the repository test suite (see `tests/structured/`)

---

## Architecture

### Dual-Strategy Design

AbstractCore implements two distinct strategies for structured output generation:

#### Strategy 1: Native Structured Output (Server-Side Enforcement)

**Mechanism**: Provider API accepts JSON schema and enforces compliance before returning response.

**Providers**:
- OpenAI (via `response_format` parameter)
- Anthropic (via tool-calling mechanism)
- Ollama (via `format` parameter)
- LMStudio (via `response_format` parameter)
- HuggingFace GGUF models (via `response_format` parameter with llama-cpp-python)

**Characteristics**:
- Server-side schema validation
- Zero client-side validation retries required
- Deterministic schema compliance
- Optimal performance for production workloads

**Validation**:
- Structured output behavior is covered by automated tests in this repo (see `tests/structured/`).
- Exact success rates and latency depend on provider/model/schema complexity.

#### Strategy 2: Prompted with Validation (Client-Side Enforcement)

**Mechanism**: Schema embedded in system prompt; response extracted, validated, and retried if necessary.

**Providers**:
- HuggingFace (Transformers models)
- MLX
- Any provider without native support

**Characteristics**:
- Schema injected into enhanced prompt
- Client-side Pydantic validation
- Automatic retry with error feedback (up to 3 attempts)
- Fallback for providers without native support

### Automatic Strategy Selection

The `StructuredOutputHandler` selects the appropriate strategy automatically:

```python
def _has_native_support(self, provider) -> bool:
    """Detect native structured output capability"""
    provider_name = provider.__class__.__name__

    # Ollama and LMStudio always have native support
    if provider_name in ['OllamaProvider', 'LMStudioProvider']:
        return True

    # HuggingFace GGUF models (via llama-cpp-python)
    if provider_name == 'HuggingFaceProvider':
        if hasattr(provider, 'model_type') and provider.model_type == 'gguf':
            return True

    # Check model capabilities for other providers
    capabilities = getattr(provider, 'model_capabilities', {})
    return capabilities.get("structured_output") == "native"
```

No configuration required—the framework handles strategy selection transparently.

---

## Provider Implementation

### OpenAI

**Implementation**: Native support via `response_format` parameter

```python
# AbstractCore implementation (simplified)
payload["response_format"] = {
    "type": "json_schema",
    "json_schema": {
        "name": response_model.__name__,
        "schema": response_model.model_json_schema()
    }
}
```

**Models with Native Support**:
- gpt-4o, gpt-4o-mini
- gpt-4-turbo
- gpt-3.5-turbo

**Reference**: [OpenAI Structured Outputs Documentation](https://platform.openai.com/docs/guides/structured-outputs)

---

### Anthropic

**Implementation**: Native support via tool-calling mechanism

The provider forces execution of a tool whose input schema matches the desired output structure.

**Models with Native Support**:
- claude-haiku-4-5
- claude-sonnet-4-5
- claude-opus-4-5

**Reference**: [Anthropic API Documentation](https://docs.anthropic.com/)

---

### Ollama

**Implementation**: Native support via `format` parameter

```python
# AbstractCore implementation (abstractcore/providers/ollama_provider.py:147-152)
if response_model and PYDANTIC_AVAILABLE:
    json_schema = response_model.model_json_schema()
    payload["format"] = json_schema  # Full schema, server-side validation
```

**Mechanism**:
1. Full JSON schema passed to Ollama API
2. Server-side constrained sampling enforces schema compliance
3. Response is expected to follow the schema (provider/model dependent)

**Notes**:
- Native structured output depends on the Ollama server/build and the selected model.
- For example coverage, see `tests/structured/`.

**Supported Models**: Many models, including:
- Llama 3.1, 3.2, 3.3 family
- Qwen 2.5, 3, 3-coder family
- Gemma 2b, 7b, gemma2, gemma3
- Mistral, Phi-3, Phi-4, GLM-4, DeepSeek-R1

**Reference**: [Ollama API Documentation](https://github.com/ollama/ollama/blob/main/docs/api.md)

---

### LMStudio

**Implementation**: Native support via OpenAI-compatible `response_format` parameter

```python
# AbstractCore implementation (abstractcore/providers/lmstudio_provider.py:211-222)
if response_model and PYDANTIC_AVAILABLE:
    json_schema = response_model.model_json_schema()
    payload["response_format"] = {
        "type": "json_schema",
        "json_schema": {
            "name": response_model.__name__,
            "schema": json_schema
        }
    }
```

**Mechanism**:
1. OpenAI-compatible format passed to LMStudio server
2. Server-side schema enforcement via underlying inference engine
3. Response is expected to follow the schema (server/model dependent)

**Notes**:
- Behavior depends on the LMStudio server version and underlying model/runtime.
- For example coverage, see `tests/structured/`.

**Reference**: [LMStudio Documentation](https://lmstudio.ai/docs)

---

### HuggingFace

**Implementation**: Dual strategy based on model type

#### GGUF Models (Native Support)

**Backend**: llama-cpp-python with native structured output

```python
# AbstractCore implementation (abstractcore/providers/huggingface_provider.py:669-680)
if response_model and PYDANTIC_AVAILABLE:
    json_schema = response_model.model_json_schema()
    generation_kwargs["response_format"] = {
        "type": "json_schema",
        "json_schema": {
            "name": response_model.__name__,
            "schema": json_schema
        }
    }
```

**Notes**:
- GGUF structured output support depends on the llama-cpp-python backend and model.
- For example coverage, see `tests/structured/`.

#### Transformers Models (Native via Outlines)

**Backend**: Hugging Face Transformers library with Outlines

**Implementation**: Native support via Outlines constrained generation

```python
# AbstractCore implementation (abstractcore/providers/huggingface_provider.py:514-548)
if response_model and PYDANTIC_AVAILABLE and OUTLINES_AVAILABLE:
    # Cache Outlines model wrapper
    if not hasattr(self, '_outlines_model'):
        self._outlines_model = outlines.from_transformers(
            self.model_instance,
            self.tokenizer
        )

    # Generate with constrained decoding
    generator = self._outlines_model(
        input_text,
        outlines.json_schema(response_model),
        max_tokens=max_tokens
    )

    # Return validated instance
    validated_obj = response_model.model_validate(generator)
```

**Mechanism**:
1. Outlines wraps transformers model and tokenizer
2. JSON schema passed to constrained generator
3. Server-side logit filtering ensures only valid tokens are sampled
4. Schema compliance is enforced via constrained decoding (provider/model dependent)
5. Automatic fallback to prompted approach if Outlines unavailable

**Installation**:
```bash
pip install "abstractcore[huggingface]"  # Includes Outlines automatically
```

**Characteristics**:
- Schema compliance via constrained decoding (still validated client-side)
- Zero or minimal validation retries when supported
- Works with many transformers-compatible models
- Automatic detection and activation when Outlines is installed
- Graceful fallback to prompted approach if Outlines is missing

**Fallback behavior**:
- If Outlines isn't available (or a backend doesn't support constrained decoding), AbstractCore falls back to prompted structured output with validation and retries.
- Exact success rates and latency depend on provider/model/hardware/schema complexity.

---

### MLX (Apple Silicon)

**Implementation**: Native via Outlines

**Backend**: MLX with Outlines constrained generation

```python
# AbstractCore implementation (abstractcore/providers/mlx_provider.py:165-197)
if response_model and PYDANTIC_AVAILABLE and OUTLINES_AVAILABLE:
    # Cache Outlines MLX model wrapper
    if not hasattr(self, '_outlines_model'):
        self._outlines_model = outlines_models.mlxlm(self.model)

    # Generate with constrained decoding
    generator = self._outlines_model(
        full_prompt,
        outlines.json_schema(response_model),
        max_tokens=max_tokens
    )

    # Return validated instance
    validated_obj = response_model.model_validate(generator)
```

**Mechanism**:
1. Outlines MLX backend wraps mlx-lm model
2. JSON schema converted to token constraints
3. Constrained sampling on Apple Silicon hardware
4. Server-side schema enforcement
5. Automatic fallback to prompted approach if Outlines unavailable

**Installation**:
```bash
pip install "abstractcore[mlx]"  # Includes Outlines automatically
```

**Models**:
- mlx-community/Qwen2.5-Coder-7B-Instruct-4bit
- mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
- All MLX-compatible models

**Characteristics**:
- Schema compliance via constrained decoding (still validated client-side)
- Zero or minimal validation retries when supported
- Optimized for Apple M-series processors
- Automatic detection and activation when Outlines installed
- Graceful fallback to prompted approach if Outlines missing

**Performance notes**:
- Prompted structured output (validation + retry) is the default fallback and is often the simplest to run.
- Constrained decoding can be slower or faster depending on backend/model/schema; benchmark on your hardware if it matters.

---

## Usage Guide

### Basic Usage

```python
from abstractcore import create_llm
from pydantic import BaseModel

class ExtractedData(BaseModel):
    name: str
    age: int
    email: str

llm = create_llm("ollama", model="qwen3:4b")
result = llm.generate(
    "Extract: Alice Johnson, 28, alice@example.com",
    response_model=ExtractedData,
    temperature=0  # Recommended for deterministic output
)

print(f"{result.name} ({result.age}): {result.email}")
```

### Tools + Structured Output (2-pass hybrid)

If you pass both `tools=[...]` and `response_model=...` in the same `generate()` call, AbstractCore uses a 2-pass hybrid flow:
1) a tool-capable call, then
2) a structured-output call using the tool-context.

Notes:
- Streaming is not supported in this hybrid mode.
- Tool execution is pass-through by default (`execute_tools=False`): your host/runtime should execute tool calls and feed results back as messages when building multi-step workflows.

### Using Enums

Enums provide type-safe categorical values:

```python
from enum import Enum
from pydantic import BaseModel

class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class Task(BaseModel):
    title: str
    priority: Priority
    estimated_hours: float

llm = create_llm("lmstudio", model="qwen/qwen3-4b-2507")
task = llm.generate(
    "Create task: Fix authentication bug, critical priority, 8 hours estimated",
    response_model=Task
)

assert isinstance(task.priority, Priority)
print(f"Priority: {task.priority.value}")  # "critical"
```

**Notes**: Enums are supported and exercised by tests; exact behavior depends on provider/model.

### Nested Objects

```python
from typing import List
from pydantic import BaseModel

class Address(BaseModel):
    street: str
    city: str
    postal_code: str

class Person(BaseModel):
    name: str
    email: str
    address: Address

llm = create_llm("openai", model="gpt-4o-mini")
person = llm.generate(
    """Extract: John Smith, john@example.com
    Address: 123 Main St, Boston, MA 02101""",
    response_model=Person
)

assert isinstance(person.address, Address)
```

### Complex Hierarchies

Complex schemas with multiple nesting levels are supported:

```python
from enum import Enum
from typing import List, Optional
from pydantic import BaseModel

class Department(str, Enum):
    ENGINEERING = "engineering"
    SALES = "sales"
    MARKETING = "marketing"

class EmployeeLevel(str, Enum):
    JUNIOR = "junior"
    MID = "mid"
    SENIOR = "senior"

class Skill(BaseModel):
    name: str
    proficiency: int  # 1-10 scale
    years_experience: float

class Employee(BaseModel):
    name: str
    email: str
    department: Department
    level: EmployeeLevel
    skills: List[Skill]
    manager_email: Optional[str] = None

class Team(BaseModel):
    name: str
    department: Department
    lead: Employee
    members: List[Employee]

class Organization(BaseModel):
    company_name: str
    founded_year: int
    teams: List[Team]
    total_employees: int

llm = create_llm("anthropic", model="claude-haiku-4-5")
org = llm.generate(
    """Create organization: TechCorp, founded 2020
    Team: Platform (engineering)
    Lead: Sarah Chen (sarah@tech.com, senior, Python-9/10-5yrs, AWS-8/10-4yrs)
    Member: Bob Lee (bob@tech.com, mid, JavaScript-7/10-3yrs, manager: sarah@tech.com)
    Total employees: 2""",
    response_model=Organization
)
```

**Notes**: Deeply nested schemas are supported; validate against your target provider/model and see `tests/structured/` for examples.

### Direct Handler Usage

For advanced use cases requiring custom retry configuration:

```python
from abstractcore.structured import StructuredOutputHandler, FeedbackRetry

# Configure custom retry strategy
handler = StructuredOutputHandler(
    retry_strategy=FeedbackRetry(max_attempts=5)
)

result = handler.generate_structured(
    provider=llm,
    prompt="Extract complex data from document...",
    response_model=ComplexSchema,
    temperature=0
)
```

---

## Schema Design

### Design Principles

Well-designed schemas improve validation success rates and reduce response times.

#### 1. Clear Field Naming

Use descriptive, unambiguous field names:

```python
# Recommended
class Employee(BaseModel):
    employee_id: str
    hire_date: str
    department: str
    annual_salary: float

# Avoid
class Employee(BaseModel):
    id: str  # Ambiguous
    date: str  # What date?
    dept: str  # Abbreviation unclear
    salary: float  # Currency? Period?
```

#### 2. Leverage Enums for Categorical Data

Enums provide validation and type safety:

```python
# Recommended
class Status(str, Enum):
    ACTIVE = "active"
    INACTIVE = "inactive"
    PENDING = "pending"

class User(BaseModel):
    status: Status  # Only valid enum values accepted

# Avoid
class User(BaseModel):
    status: str  # Any string accepted, no validation
```

#### 3. Use Optional Fields Appropriately

Distinguish required from optional fields:

```python
from typing import Optional, List

class Task(BaseModel):
    # Required fields
    title: str
    created_at: str

    # Optional with defaults
    description: str = ""
    tags: List[str] = []

    # Truly optional (may be None)
    assigned_to: Optional[str] = None
    due_date: Optional[str] = None
```

#### 4. Logical Hierarchy

Group related fields into nested objects:

```python
# Recommended
class ContactInfo(BaseModel):
    email: str
    phone: str
    address: str

class Person(BaseModel):
    name: str
    contact: ContactInfo  # Logical grouping

# Avoid flat structure
class Person(BaseModel):
    name: str
    email: str
    phone: str
    address: str
```

### Complexity Guidelines

Schema complexity affects latency and cost; keep schemas as small as practical.

#### Simple Schemas (< 10 fields, 1 level)

**Example**:
```python
class PersonInfo(BaseModel):
    name: str
    age: int
    email: str
    occupation: str
```

**Recommended for**: User profiles, data extraction, form processing

#### Medium Schemas (10-30 fields, 1-2 levels)

**Example**:
```python
class Project(BaseModel):
    name: str
    description: str
    start_date: str
    tasks: List[Task]  # Nested objects
    total_hours: float
```

**Recommended for**: Project management, task tracking, structured data extraction

#### Complex Schemas (30+ fields, 3+ levels)

**Example**:
```python
class Organization(BaseModel):
    company_name: str
    teams: List[Team]  # Level 2
    # Team contains:
    #   lead: Employee  # Level 3
    #   members: List[Employee]  # Level 3
    #     # Employee contains:
    #     #   skills: List[Skill]  # Level 4
```

**Recommended for**: Organizational hierarchies, knowledge graphs, complex data models

### Anti-Patterns

Avoid these patterns that can degrade performance or reliability:

#### 1. Excessive Nesting Depth (>4 levels)

```python
# Avoid
class Level1(BaseModel):
    level2: Level2
    # Level2 -> Level3 -> Level4 -> Level5 (too deep)
```

**Impact**: Increased token usage, longer response times

#### 2. Ambiguous Enum Values

```python
# Avoid
class Status(str, Enum):
    ONE = "1"
    TWO = "2"
    THREE = "3"

# Recommended
class Status(str, Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"
```

#### 3. Overly Long Field Names

```python
# Avoid
class Data(BaseModel):
    very_long_and_descriptive_field_name_that_uses_many_tokens: str

# Recommended
class Data(BaseModel):
    user_email: str  # Clear but concise
```

**Impact**: Increases token count, affecting cost and context window

---

## Performance Characteristics

Structured output performance is highly dependent on:
- Provider/backend strategy (native constrained decoding vs prompted validation/retry)
- Schema complexity (field count + nesting depth)
- Model choice, server configuration, and hardware
- Sampling settings (use `temperature=0` when you care about schema fidelity)

If performance matters, benchmark on your target provider/model/hardware.
Historical benchmark notes (non-authoritative) may exist under `docs/reports/`.

### Temperature Settings

**Recommendation**: Use `temperature=0` for structured outputs

**Rationale**:
- Deterministic responses
- Consistent schema compliance
- Reduced sampling overhead

**When to increase temperature**:
- Creative content generation within schema constraints
- Diverse response generation for the same prompt
- Exploratory data generation

---

## Error Handling

### Error Categories

#### 1. Infrastructure Errors (Retriable)

Network failures, timeouts, server unavailability—retry with exponential backoff:

```python
import time
from requests.exceptions import ConnectionError, Timeout

def generate_with_retry(llm, prompt, response_model, max_retries=3):
    """Retry infrastructure errors with exponential backoff"""
    for attempt in range(max_retries):
        try:
            return llm.generate(
                prompt,
                response_model=response_model,
                temperature=0
            )
        except (ConnectionError, Timeout) as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # 1s, 2s, 4s
                time.sleep(wait_time)
                continue
            raise

result = generate_with_retry(llm, "Extract data...", DataModel)
```

**Retriable errors**:
- `ConnectionError`: Network connectivity issues
- `TimeoutError`: Request timeout
- HTTP 5xx: Server errors
- Token limit exceeded (retry with simplified schema or chunking)

#### 2. Validation Errors (Non-Retriable)

Schema validation failures indicate schema or prompt issues—do not retry:

```python
from pydantic import ValidationError

try:
    result = llm.generate(
        "Extract user data...",
        response_model=UserModel
    )
except ValidationError as e:
    # Log validation errors
    print("Schema validation failed:")
    for error in e.errors():
        field = " -> ".join(str(loc) for loc in error['loc'])
        print(f"  {field}: {error['msg']}")

    # Fix schema or prompt—do not retry
    raise
```

**Common validation errors**:
- Missing required fields: Schema too strict or prompt unclear
- Type mismatches: Field type incompatible with LLM output
- Enum validation failures: LLM returned invalid enum value

**Resolution**: Revise schema or improve prompt clarity

#### 3. Token Limit Errors

Context window exceeded—simplify schema or split request:

```python
try:
    result = llm.generate(prompt, response_model=ComplexModel)
except Exception as e:
    if "token" in str(e).lower() or "context" in str(e).lower():
        print("Token limit exceeded. Options:")
        print("1. Simplify schema (reduce fields or nesting)")
        print("2. Split into multiple requests")
        print("3. Use model with larger context window")
        raise
```

### Retry Strategy Details

The default `FeedbackRetry` strategy:

1. **Maximum attempts**: 3 (configurable)
2. **Retry condition**: Only `ValidationError` exceptions
3. **Feedback mechanism**: Provides detailed error descriptions to LLM

**Example error feedback**:
```
Your previous response had validation errors:
• Missing required field: 'department'
• Field 'employee_level': Expected one of: junior, mid, senior
• Field 'age': Expected integer, received string
```

The LLM uses this feedback to self-correct on subsequent attempts.

**Configuration**:
```python
from abstractcore.structured import StructuredOutputHandler, FeedbackRetry

handler = StructuredOutputHandler(
    retry_strategy=FeedbackRetry(max_attempts=5)
)
```

---

## Production Deployment

### Pre-Deployment Checklist

Before deploying structured outputs to production:

- [ ] Schema validated locally with Pydantic: `Model.model_validate(test_data)`
- [ ] Success rate measured with target model (target: >95%)
- [ ] Response time benchmarked under expected load
- [ ] Error handling implemented for infrastructure failures
- [ ] Logging configured for validation errors and retries
- [ ] Monitoring configured for success rates and latencies
- [ ] Fallback strategy defined for structured output failures
- [ ] Token limits verified: `len(prompt) + len(schema) + len(response) < context_window`

### Monitoring Metrics

Track these metrics in production:

**Success Metrics**:
- Validation success rate (target: >95%)
- First-attempt success rate
- Average retry count

**Performance Metrics**:
- p50, p95, p99 response times
- Response time by schema complexity
- Token usage statistics

**Error Metrics**:
- Validation error rate by field
- Infrastructure error rate
- Token limit exceeded rate

### Configuration Best Practices

**Temperature**: Set to 0 for deterministic structured outputs
```python
llm.generate(prompt, response_model=Model, temperature=0)
```

**Timeout**: Configure appropriate timeouts based on schema complexity
```python
# Simple schemas: 30s
# Medium schemas: 60s
# Complex schemas: 120s
```

**Provider Selection**:
- Development: Use local providers (Ollama, LMStudio) for cost efficiency
- Production: Select based on performance requirements and budget

### Schema Versioning

Maintain schema version compatibility:

```python
from pydantic import BaseModel, Field

class UserV1(BaseModel):
    name: str
    email: str

class UserV2(BaseModel):
    name: str
    email: str
    department: str = Field(default="unassigned")  # Backward compatible
```

Use optional fields with defaults for backward-compatible schema evolution.

---

## API Reference

### Core Function

```python
llm.generate(
    prompt: str,
    response_model: Type[BaseModel],
    temperature: float = 0.0,
    **kwargs
) -> BaseModel
```

**Parameters**:
- `prompt` (str): Input prompt describing extraction/generation task
- `response_model` (Type[BaseModel]): Pydantic model class defining output schema
- `temperature` (float): Sampling temperature (0.0 = deterministic, 1.0 = creative)
- `**kwargs`: Additional provider-specific parameters

**Returns**:
- Instance of `response_model`, validated and type-safe

**Raises**:
- `ValidationError`: Schema validation failed after all retry attempts
- `ConnectionError`: Network/infrastructure error
- `TimeoutError`: Request timeout

**Example**:
```python
person = llm.generate(
    "Extract: John Doe, age 35",
    response_model=Person,
    temperature=0
)
```

### StructuredOutputHandler

Advanced handler for custom retry strategies:

```python
from abstractcore.structured import StructuredOutputHandler

handler = StructuredOutputHandler(retry_strategy=None)
```

**Methods**:

```python
handler.generate_structured(
    provider: LLMProvider,
    prompt: str,
    response_model: Type[BaseModel],
    **kwargs
) -> BaseModel
```

Generates structured output with automatic strategy selection (native or prompted).

### Retry Strategies

```python
from abstractcore.structured import FeedbackRetry

retry = FeedbackRetry(max_attempts=3)
```

**Parameters**:
- `max_attempts` (int): Maximum retry attempts including initial attempt

**Methods**:
- `should_retry(attempt, error)`: Returns True if retry should occur
- `prepare_retry_prompt(prompt, error, attempt)`: Creates retry prompt with validation feedback

---

## Related Documentation

- [Getting Started](getting-started.md#structured-output) - Quick introduction
- [API Reference](api-reference.md) - Complete API documentation
- [Examples](examples.md#structured-output-examples) - Real-world usage patterns
- [Response Model Parameter Analysis](archive/structured-response-keyword.md) - Why `response_model`
- [Native Implementation Test Results](archive/improved-structured-response.md) - Detailed test data

---

## References

- [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs)
- [Anthropic API Documentation](https://docs.anthropic.com/)
- [Ollama API Documentation](https://github.com/ollama/ollama/blob/main/docs/api.md)
- [Pydantic Documentation](https://docs.pydantic.dev/)

---

---

## Testing and Validation

Structured output behavior is exercised by automated tests under `tests/structured/`.

### Running tests

From this repository:

```bash
pip install -e ".[test]"
pytest tests/structured -q
```

Some provider-specific tests require additional extras:

- HuggingFace / Outlines: `pip install -e ".[huggingface]"`
- MLX: `pip install -e ".[mlx]"` (macOS + Apple Silicon only)

If you're installing from PyPI and just want the test dependencies:

```bash
pip install "abstractcore[test]"
pytest -q
```

### Notes

- Performance and success rates vary widely by provider/model/schema complexity and are not guaranteed.
- If performance matters, benchmark on your target hardware/provider setup.

---

### Inlined: `docs/media-handling-system.md`

# Media Handling System

AbstractCore provides a **production-ready unified media handling system** that enables seamless file attachment and processing across all LLM providers and models. The system automatically processes images, documents, and other media files using the same simple API, with intelligent provider-specific formatting and graceful fallback handling.

## Key Benefits

- **Universal API**: Same `media=[]` parameter works across all providers (OpenAI, Anthropic, Ollama, LMStudio, etc.)
- **Intelligent Processing**: Automatic file type detection with specialized processors for each format
- **Provider Adaptation**: Automatic formatting for each provider's API requirements (JSON for OpenAI, XML for Anthropic, etc.)
- **Robust Fallback**: Graceful degradation when advanced processing fails, always provides meaningful results
- **CLI Integration**: Simple `@filename` syntax in CLI for instant file attachment
- **Production Quality**: Comprehensive error handling, logging, and performance optimization
- **Cross-Format Support**: Images, PDFs, Office documents, CSV/TSV, text files all work seamlessly

## Quick Start

```python
from abstractcore import create_llm

# Works with any provider - just change the provider name
llm = create_llm("openai", model="gpt-4o", api_key="your-key")
response = llm.generate(
    "What's in this image and document?",
    media=["photo.jpg", "report.pdf"]
)
print(response.content)

# Same code works with Anthropic
llm = create_llm("anthropic", model="claude-3.5-sonnet", api_key="your-key")
response = llm.generate(
    "Analyze these materials",
    media=["chart.png", "data.csv", "presentation.ppt"]
)

# Or with local models
llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate(
    "Describe this image",
    media=["screenshot.png"]
)
```

## How It Works Behind the Scenes

AbstractCore's media system uses a sophisticated multi-layer architecture that seamlessly processes any file type and formats it correctly for each LLM provider:

### 1. File Attachment Processing

**CLI Integration (`@filename` syntax):**
```python
# User types: "Analyze this @report.pdf and @chart.png"
# MessagePreprocessor extracts files and cleans text:
clean_text = "Analyze this  and"  # File references removed
media_files = ["report.pdf", "chart.png"]  # Extracted file paths
```

**Python API:**
```python
# Direct media parameter usage
llm.generate("Analyze these files", media=["report.pdf", "chart.png"])
```

### 2. Intelligent File Processing Pipeline

**AutoMediaHandler Coordination:**
```python
# 1. Detect file types automatically
MediaType.IMAGE     -> ImageProcessor (PIL-based)
MediaType.DOCUMENT  -> PDFProcessor (pypdf) or OfficeProcessor (Unstructured)
MediaType.TEXT      -> TextProcessor (pandas for CSV/TSV)

# 2. Process each file with specialized processor
pdf_content = PDFProcessor.process("report.pdf")      # → text/Markdown from permissive PDF extraction
image_content = ImageProcessor.process("chart.png")   # → Base64 + metadata
```

**Graceful Fallback System:**
```python
try:
    # Specialized processing (pypdf, Unstructured)
    content = advanced_processor.process(file)
except Exception:
    # Always falls back to basic processing
    content = basic_text_extraction(file)  # Never fails
```

### 3. Provider-Specific Formatting

**The same processed content gets formatted differently for each provider:**

**OpenAI Format (JSON):**
```json
{
  "role": "user",
  "content": [
    {"type": "text", "text": "Analyze these files"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0..."}},
    {"type": "text", "text": "PDF Content: # Report Title\n\nExecutive Summary..."}
  ]
}
```

**Anthropic Format (Messages API):**
```json
{
  "role": "user",
  "content": [
    {"type": "text", "text": "Analyze these files"},
    {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "iVBORw0..."}},
    {"type": "text", "text": "PDF Content: # Report Title\n\nExecutive Summary..."}
  ]
}
```

**Local Models (Text Embedding):**
```python
# For local models without native multimodal support
combined_prompt = """
Analyze these files:

Image Analysis: [A business chart showing quarterly revenue trends...]
PDF Content: # Report Title

Executive Summary...
"""
```

### 4. Cross-Provider Workflow

```mermaid
graph TD
    A[User Input with @files] --> B[MessagePreprocessor]
    B --> C[Extract Files + Clean Text]
    C --> D[AutoMediaHandler]
    D --> E{File Type?}
    E -->|Image| F[ImageProcessor]
    E -->|PDF| G[PDFProcessor]
    E -->|Office| H[OfficeProcessor]
    E -->|Text| I[TextProcessor]
    F --> J[MediaContent Objects]
    G --> J
    H --> J
    I --> J
    J --> K{Provider Type?}
    K -->|OpenAI| L[OpenAIMediaHandler]
    K -->|Anthropic| M[AnthropicMediaHandler]
    K -->|Local| N[LocalMediaHandler]
    L --> O[Provider-Specific API Format]
    M --> O
    N --> O
    O --> P[LLM API Call]
    P --> Q[Response to User]
```

### 5. Error Handling & Resilience

**Multi-Level Fallback Strategy:**
1. **Specialized Processing**: Use configured specialized libraries (pypdf, Unstructured)
2. **Basic Processing**: Fall back to simple text extraction
3. **Metadata Only**: If all else fails, provide file metadata
4. **Graceful Degradation**: Best-effort results with clear errors (no silent semantic changes)

**Example of Robust Error Handling:**
```python
try:
    # Try default PDF processing with pypdf
    content = pdf_processor.extract_with_formatting(file)
except PDFProcessingError:
    try:
        # Fall back to basic text extraction
        content = pdf_processor.extract_basic_text(file)
    except Exception:
        # Ultimate fallback - provide metadata
        content = f"PDF file: {file.name} ({file.size} bytes)"

# Result: Callers get a best-effort output or a clear error message (no silent truncation).
```

## Supported File Types

### Images (Vision Models)
- **Formats**: PNG, JPEG, GIF, WEBP, BMP, TIFF
- **Automatic**: Optimization, resizing, format conversion
- **Features**: EXIF handling, quality optimization for vision models

### Documents
- **Text Files**: TXT, MD, CSV, TSV, JSON with intelligent parsing and data analysis
- **PDF**: Text and metadata extraction with pypdf by default. Optional PyMuPDF4LLM layout extraction requires the explicit `abstractcore[pdf-pymupdf-commercial]` extra after license review.
- **Office**: DOCX, XLSX, PPTX via Unstructured (when installed), with best-effort extraction
  - **Word**: section/paragraph extraction
  - **Excel**: sheet-by-sheet extraction
  - **PowerPoint**: slide-by-slide extraction

### Audio (policy-driven; optional STT fallback)
- **Formats**: common `audio/*` types (WAV, MP3, M4A, …) as attachments via `media=[...]`
- **Default behavior**: `audio_policy="auto"` uses native audio when supported and otherwise requires the configured `input.voice` route
- **Speech-to-text**: `audio_policy="speech_to_text"` runs STT via the capability plugin layer (`llm.audio.transcribe(...)`) and injects a transcript into the main request; normal defaults should route this through `input.voice`
- **Auto**: `audio_policy="auto"` uses native audio when supported, otherwise the `input.voice` route, otherwise errors
- **Reserved**: `audio_policy="caption"` is not configured in v0 (must error; non-speech audio analysis needs an explicit capability)

Transparency:
- When STT fallback is used, `GenerateResponse.metadata.media_enrichment[]` records what was injected and which backend was used.

Requirements:
- **Native audio** requires an audio-capable model.
- **STT fallback** requires installing an STT capability plugin (typically `pip install abstractvoice`) and configuring `input.voice`; `abstractcore --set-audio-strategy auto` only chooses the policy.

### Video (policy-driven; native or frames fallback)
- **Formats**: common `video/*` types as attachments via `media=[...]`
- **Default behavior**: `video_policy="auto"` (native video when supported; otherwise sample frames and route through visual support on `input.text` or an explicit `input.video` route)
- **Budgets**: frame count and downscale are explicit and logged (see `abstractcore/providers/base.py`)

Requirements:
- Frame sampling fallback requires **`ffmpeg`/`ffprobe`** available on `PATH`.
- For the sampled-frame path, you also need visual handling: either a vision-capable main model, an `input.text` route that supports frames/images, or an explicit `input.video` route; for local frame attachments, install `abstractcore[media]` so Pillow-based image processing is available.

### Processing Features
- **Intelligent Detection**: Automatic file type recognition and processor selection
- **Content Optimization**: Format-specific processing optimized for LLM consumption
- **Robust Fallback**: Graceful degradation ensures users always get meaningful results
- **Performance Optimized**: Lazy loading and efficient memory usage
- **Testing status**: Coverage varies by provider and modality; see the test suite under `tests/media_handling/`

### Token Estimation & No Truncation Policy

AbstractCore processors **do not silently truncate content**. This design decision ensures:

1. **No data loss**: Full file content is always preserved
2. **User control**: Callers decide how to handle large files (summarize, chunk, error)
3. **Model flexibility**: Works correctly across models with different context limits (8K to 200K+)

**Token estimation** is automatically added to `MediaContent.metadata`:
```python
result = processor.process_file("data.csv")
print(result.media_content.metadata['estimated_tokens'])  # e.g., 1500
print(result.media_content.metadata['content_length'])    # e.g., 6000 chars
```

**Handlers use this for validation**:
```python
handler = OpenAIMediaHandler()
tokens = handler.estimate_tokens_for_media(media_content)
# Uses metadata['estimated_tokens'] if available, falls back to heuristic
```

For large files that exceed model context limits, use `BasicSummarizer` or implement custom chunking at the application layer.

## Provider Compatibility

### Vision-Enabled Providers

| Provider | Vision Models | Image Support | Document Support |
|----------|---------------|---------------|------------------|
| **OpenAI** | GPT-4o, GPT-4 Turbo with Vision | Supported: Multi-image | Supported: All formats |
| **Anthropic** | Claude 3.5 Sonnet, Claude 4 series | Supported: Up to 20 images | Supported: All formats |
| **Ollama** | qwen2.5vl:7b, gemma3:4b, llama3.2-vision:11b | Supported: Single image | Supported: All formats |
| **LMStudio** | qwen2.5-vl-7b, gemma-3n-e4b, magistral-small-2509 | Supported: Multiple images | Supported: All formats |

### Text-Only Providers

All providers support document processing even without vision capabilities:

| Provider | Document Processing | Text Extraction |
|----------|-------------------|-----------------|
| **HuggingFace** | Supported: All formats | Supported: Embedded in prompt |
| **MLX** | Supported: All formats | Supported: Embedded in prompt |
| **Any Provider** | Supported: Automatic fallback | Supported: Text extraction |

### ⚠️ Model Compatibility Notes (Updated: 2025-10-17)

Some newer vision models may not be immediately available due to rapid development:

**LMStudio Limitations:**
- `qwen3-vl` models (8B, 30B) - Not yet supported in LMStudio
- Use `qwen2.5-vl-7b` as a proven alternative

**HuggingFace Limitations:**
- `Qwen3-VL` models - Require newer transformers architecture
- Install latest transformers: `pip install --upgrade transformers`
- Or use bleeding edge: `pip install git+https://github.com/huggingface/transformers.git`

**Recommended Stable Models (2025-10-17):**
- **LMStudio**: `qwen/qwen2.5-vl-7b`, `google/gemma-3n-e4b`, `mistralai/magistral-small-2509`
- **Ollama**: `qwen2.5vl:7b`, `gemma3:4b`, `llama3.2-vision:11b`
- **OpenAI**: `gpt-4o`, `gpt-4-turbo-with-vision`
- **Anthropic**: `claude-3.5-sonnet`, `claude-4-series`

## Usage Examples

### Vision Analysis

```python
from abstractcore import create_llm

# Analyze images with any vision model
llm = create_llm("openai", model="gpt-4o")

# Single image analysis
response = llm.generate(
    "What's happening in this image?",
    media=["photo.jpg"]
)

# Multiple images comparison
response = llm.generate(
    "Compare these two charts and explain the trends",
    media=["chart1.png", "chart2.png"]
)

# Mixed media analysis
response = llm.generate(
    "Summarize the report and relate it to what you see in the image",
    media=["financial_report.pdf", "stock_chart.png"]
)
```

### Document Processing

```python
# PDF analysis
response = llm.generate(
    "Summarize the key findings from this research paper",
    media=["research_paper.pdf"]
)

# Office document processing
response = llm.generate(
    "Create a summary of this presentation and spreadsheet",
    media=["quarterly_results.ppt", "financial_data.xlsx"]
)

# CSV data analysis
response = llm.generate(
    "What patterns do you see in this sales data?",
    media=["sales_data.csv"]
)
```

### CLI Usage

These examples work in AbstractCore CLI when `abstractcore[media]` is installed and your selected provider/model supports the requested media (or you configured fallbacks):

```bash
# PDF Analysis - Working
python -m abstractcore.utils.cli --prompt "What is this document about? @report.pdf"

# Office Documents - Working
python -m abstractcore.utils.cli --prompt "Summarize this presentation @slides.pptx"
python -m abstractcore.utils.cli --prompt "What data is in @spreadsheet.xlsx"
python -m abstractcore.utils.cli --prompt "Analyze this document @contract.docx"

# Data Files - Working
python -m abstractcore.utils.cli --prompt "What patterns are in @sales_data.csv"
python -m abstractcore.utils.cli --prompt "Analyze this data @metrics.tsv"

# Images - Working
python -m abstractcore.utils.cli --prompt "What's in this image? @screenshot.png"

# Mixed Media - Working
python -m abstractcore.utils.cli --prompt "Compare @chart.png and @data.csv and explain trends"
```

### Cross-provider semantics (what’s consistent)

```python
# AbstractCore exposes a single `media=[...]` parameter across providers, but behavior
# depends on provider/model capabilities and your media policies.

# Documents (PDF/Office/text/CSV/TSV/...) are extracted to text/metadata and injected into the request.
# This generally works across providers because the final payload is text.
media_files = ["report.pdf", "data.xlsx"]
prompt = "Analyze these documents and provide insights"

# OpenAI
openai_llm = create_llm("openai", model="gpt-4o")
openai_response = openai_llm.generate(prompt, media=media_files)

# Anthropic
anthropic_llm = create_llm("anthropic", model="claude-haiku-4-5")
anthropic_response = anthropic_llm.generate(prompt, media=media_files)

# Image/audio/video inputs are policy-driven and require native support or explicit fallbacks.
# See: docs/vision-capabilities.md and docs/media-handling-system.md (policies + fallbacks).
```

### Streaming with Media

```python
# Real-time streaming responses with media
llm = create_llm("openai", model="gpt-4o")  # requires: pip install "abstractcore[openai]"

for chunk in llm.generate(
    "Describe this image in detail",
    media=["complex_diagram.png"],
    stream=True
):
    print(chunk.content or "", end="", flush=True)
```

## Advanced Features

### Maximum Resolution Optimization (NEW)

AbstractCore automatically optimizes image resolution for each model's maximum capability, ensuring optimal vision results:

```python
from abstractcore import create_llm

# Images are automatically optimized for each model's maximum resolution
llm = create_llm("openai", model="gpt-4o")
response = llm.generate(
    "Analyze this image in detail",
    media=["photo.jpg"]  # Auto-resized to 4096x4096 for GPT-4o
)

# Different model, different optimization
llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate(
    "What's in this image?",
    media=["photo.jpg"]  # Auto-resized to 3584x3584 for qwen2.5vl
)
```

**Model-Specific Resolution Limits:**
- **GPT-4o**: Up to 4096x4096 pixels
- **Claude 3.5 Sonnet**: Up to 1568x1568 pixels
- **qwen2.5vl:7b**: Up to 3584x3584 pixels
- **gemma3:4b**: Up to 896x896 pixels
- **llama3.2-vision:11b**: Up to 560x560 pixels

**Benefits:**
- **Better Accuracy**: Higher resolution means more detail for the model to analyze
- **Automatic**: No manual configuration required
- **Provider-Aware**: Adapts to each provider's optimal settings
- **Quality Optimization**: Increased JPEG quality (90%) for better compression

### Capability Detection

The system automatically detects model capabilities and adapts accordingly:

```python
from abstractcore.media.capabilities import is_vision_model, supports_images

# Check if a model supports vision
if is_vision_model("gpt-4o"):
    print("This model can process images")

if supports_images("claude-3.5-sonnet"):
    print("This model supports image analysis")

# Text-only model + image input is policy-driven
llm = create_llm("openai", model="gpt-4")  # text-only example
response = llm.generate(
    "Analyze this image",
    media=["photo.jpg"],  # Errors unless vision fallback is configured; see below.
)
```

### Vision fallback (optional; config-driven)

AbstractCore includes an optional **vision fallback** that enables text-only models to process images using a transparent two-stage pipeline (caption → inject short observations).

#### How Vision Fallback Works

When vision fallback is configured and you use a text-only model with images, AbstractCore:

1. **Detects Model Limitations**: Identifies when a text-only model receives an image
2. **Uses Vision Fallback**: Employs a configured vision model to analyze the image
3. **Provides Description**: Passes the image description to the text-only model
4. **Returns Results**: Your text model answers using the injected observations (recorded in `metadata.media_enrichment[]`)

#### Example

Configure a vision captioner once:

```bash
abstractcore --set-vision-provider lmstudio qwen/qwen3-vl-4b
```

Then use any text model with images:

```python
from abstractcore import create_llm

llm = create_llm("lmstudio", model="qwen/qwen3-next-80b")  # text-only
resp = llm.generate("What's in this image?", media=["whale_photo.jpg"])
print(resp.content)
```

#### Behind the Scenes

What actually happens (transparent to user):
1. **Stage 1**: `qwen2.5vl:7b` (vision model) analyzes `whale_photo.jpg` → detailed description
2. **Stage 2**: `qwen/qwen3-next-80b` (text-only) processes description + user question → final analysis

#### Configuration Commands

```bash
# Check current status
abstractcore --status

# Download local caption models (optional)
abstractcore --download-vision-model              # BLIP base (990MB)
abstractcore --download-vision-model vit-gpt2     # ViT-GPT2 (500MB, CPU-friendly)
abstractcore --download-vision-model git-base     # GIT base (400MB, smallest)

# Use an existing vision-capable model as the fallback captioner
abstractcore --set-vision-provider ollama qwen2.5vl:7b
abstractcore --set-vision-provider lmstudio qwen/qwen3-vl-4b
abstractcore --set-vision-provider openai gpt-4o
abstractcore --set-vision-provider anthropic claude-sonnet-4-5

# Interactive setup
abstractcore --config

# Advanced: Fallback chains
abstractcore --add-vision-fallback ollama qwen2.5vl:7b
abstractcore --add-vision-fallback openai gpt-4o
```

#### Benefits of Vision Fallback

- **Universal Compatibility**: Any text-only model can now process images
- **Cost Optimization**: Use cheaper text models for reasoning, vision models only for description
- **Transparent Operation**: Users don't need to change their code
- **Flexible Configuration**: Local models, cloud APIs, or hybrid setups
- **Offline-First**: Works without internet after downloading local models
- **Automatic Fallback**: Graceful degradation when vision not configured

#### Supported Vision Models

**Local Models (Downloaded):**
- **BLIP Base**: 990MB, high quality, CPU/GPU compatible
- **ViT-GPT2**: 500MB, CPU-friendly, good performance
- **GIT Base**: 400MB, smallest size, basic quality

**Provider Models:**
- **Ollama**: `qwen2.5vl:7b`, `llama3.2-vision:11b`, `gemma3:4b`
- **LMStudio**: `qwen/qwen2.5-vl-7b`, `google/gemma-3n-e4b`
- **OpenAI**: `gpt-4o`, `gpt-4-turbo-with-vision`
- **Anthropic**: `claude-3.5-sonnet`, `claude-4-series`

### Custom Processing Options

```python
# Advanced image processing
from abstractcore.media.processors import ImageProcessor

processor = ImageProcessor(
    optimize_for_vision=True,
    max_dimension=1024,
    quality=85
)

# Advanced PDF processing
from abstractcore.media.processors import PDFProcessor

pdf_processor = PDFProcessor(
    extract_images=True,
    markdown_output=True,
    preserve_tables=True
)
```

### Direct Media Processing

```python
# Process files directly (without LLM)
from abstractcore.media import process_file

# Process any supported file
result = process_file("document.pdf")
if result.success:
    print(f"Content: {result.media_content.content}")
    print(f"Type: {result.media_content.media_type}")
    print(f"Metadata: {result.media_content.metadata}")
```

## Recommended Practices

### File Size and Limits

```python
# Check model-specific limits
from abstractcore.media.capabilities import get_media_capabilities

caps = get_media_capabilities("gpt-4o")
print(f"Max images per message: {caps.max_images}")
print(f"Supported formats: {caps.supported_formats}")
```

### Error Handling

```python
try:
    response = llm.generate(
        "Analyze this file",
        media=["large_document.pdf"]
    )
except Exception as e:
    print(f"Media processing error: {e}")
    # Fallback to text-only processing
    response = llm.generate("Analyze the uploaded document content")
```

### Performance Tips

```python
# For large documents, consider chunking
from abstractcore.media.processors import PDFProcessor

processor = PDFProcessor(chunk_size=8000)  # Process in chunks

# For multiple images, process in batches
image_files = ["img1.jpg", "img2.jpg", "img3.jpg"]
for batch in [image_files[i:i+3] for i in range(0, len(image_files), 3)]:
    response = llm.generate("Analyze these images", media=batch)
```

## Model-Specific Examples

### OpenAI GPT-4o

```python
# Multi-image analysis with high detail
llm = create_llm("openai", model="gpt-4o")
response = llm.generate(
    "Compare these architectural photos and identify the styles",
    media=["building1.jpg", "building2.jpg", "building3.jpg"]
)
```

### Anthropic Claude 3.5 Sonnet

```python
# Document analysis with specialized prompts
llm = create_llm("anthropic", model="claude-3.5-sonnet")
response = llm.generate(
    "Provide a comprehensive analysis of this research paper",
    media=["academic_paper.pdf"]
)
```

### Local Vision Models

```python
# Ollama with qwen2.5-vl
ollama_llm = create_llm("ollama", model="qwen2.5vl:7b")
response = ollama_llm.generate(
    "What objects do you see in this image?",
    media=["scene.jpg"]
)

# LMStudio with qwen2.5-vl
lmstudio_llm = create_llm("lmstudio", model="qwen/qwen2.5-vl-7b")
response = lmstudio_llm.generate(
    "Describe this chart and its trends",
    media=["business_chart.png"]
)

# Ollama with Llama 3.2 Vision
llama_llm = create_llm("ollama", model="llama3.2-vision:11b")
response = llama_llm.generate(
    "Analyze this document layout",
    media=["document.jpg"]
)
```

## Installation

### Basic Installation

```bash
# Core media handling (images, text, basic documents)
pip install "abstractcore[media]"
```

### Full Installation

```bash
# Media features (PDF + Office docs) are covered by `abstractcore[media]`.
# Compose only what your app needs:
pip install "abstractcore[remote,media,tools]"

# Or choose a turnkey local-runtime install:
pip install "abstractcore[all-apple]"    # Apple Silicon: HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]"      # NVIDIA GPU: HF/GGUF + vLLM + features + server
```

Advanced: If you prefer to install only the pieces you need (instead of `abstractcore[media]`),
these are the main libraries AbstractCore uses:

- `Pillow` (images)
- `pypdf` (permissive PDF text/metadata extraction)
- `unstructured[docx,pptx,xlsx,odt,rtf]` (Office docs)
- `pandas` (tabular helpers)

## Troubleshooting

### Common Issues

**Media not processed:**
```python
# Check if media dependencies are installed
try:
    response = llm.generate("Test", media=["test.jpg"])
except ImportError as e:
    print(f"Missing dependency: {e}")
    print('Install with: pip install "abstractcore[media]"')
```

**Vision model not detecting images:**
```python
# Verify model capabilities
from abstractcore.media.capabilities import is_vision_model

if not is_vision_model("your-model"):
    print("This model doesn't support vision")
    print("Try: gpt-4o, claude-3.5-sonnet, qwen2.5vl:7b, or llama3.2-vision:11b")
```

**Large file processing:**
```python
# For large files, check size limits
import os
file_size = os.path.getsize("large_file.pdf")
if file_size > 10 * 1024 * 1024:  # 10MB
    print("File may be too large for some providers")
```

### Validation

```bash
# Test your installation
python validate_media_system.py

# Run comprehensive tests
python -m pytest tests/media_handling/ -v
```

## API Reference

### Core Functions

```python
# Main generation with media
llm.generate(prompt, media=files, **kwargs)

# Direct file processing
from abstractcore.media import process_file
result = process_file(file_path)

# Capability detection
from abstractcore.media.capabilities import (
    is_vision_model,
    supports_images,
    get_media_capabilities
)
```

### Media Types

```python
from abstractcore.media.types import MediaType, ContentFormat

# MediaType.IMAGE, MediaType.DOCUMENT, MediaType.TEXT
# ContentFormat.BASE64, ContentFormat.TEXT, ContentFormat.BINARY
```

### Processors

```python
from abstractcore.media.processors import (
    ImageProcessor,    # Images with PIL
    TextProcessor,     # Text, CSV, JSON with pandas
    PDFProcessor,      # PDFs with pypdf by default
    OfficeProcessor    # DOCX, XLSX, PPT with unstructured
)
```

## Next Steps

- **[Getting Started Guide](getting-started.md)** - Complete AbstractCore tutorial
- **[API Reference](api-reference.md)** - Full Python API documentation
- **[Glyph + Vision Example](../examples/media/glyph_complete_example.py)** - End-to-end document analysis with a vision model
- **[Supported Formats Utility](../examples/media/list_supported_formats.py)** - Inspect available processors and supported formats

---

The media handling system makes AbstractCore multimodal while maintaining the same "write once, run everywhere" philosophy. Focus on your application logic while AbstractCore handles the complexity of different provider APIs and media formats.

---

### Inlined: `docs/embeddings.md`

# Vector Embeddings Guide

AbstractCore includes built-in support for vector embeddings with **7 providers**: HuggingFace (local), Ollama, LMStudio, OpenAI, OpenRouter, Portkey, and any OpenAI-compatible endpoint. This guide shows you how to use embeddings for semantic search, RAG applications, and similarity analysis.

**Two ways to use embeddings:**
1. **Python Library** (this guide) - Direct programmatic usage via `EmbeddingManager`
2. **REST API** - HTTP endpoints via AbstractCore server (see [Server API Reference](server.md#embeddings))

## Quick Start

### Installation

```bash
# Install with embeddings support
pip install "abstractcore[embeddings]"
```

### First Embeddings

```python
from abstractcore.embeddings import EmbeddingManager

# Option 1: HuggingFace (default) - Local models with optional ONNX acceleration
embedder = EmbeddingManager()  # Uses all-MiniLM-L6-v2 by default

# Option 2: Ollama - Local models via Ollama API
embedder = EmbeddingManager(
    provider="ollama",
    model="granite-embedding:278m"
)

# Option 3: LMStudio - Local models via LMStudio API
embedder = EmbeddingManager(
    provider="lmstudio",
    model="text-embedding-all-minilm-l6-v2"
)

# Generate embedding for a single text (works with all providers)
embedding = embedder.embed("Machine learning transforms how we process information")
print(f"Embedding dimension: {len(embedding)}")  # 384 for MiniLM

# Compute similarity between texts (works with all providers)
similarity = embedder.compute_similarity(
    "artificial intelligence",
    "machine learning"
)
print(f"Similarity: {similarity:.3f}")  # 0.847
```

## Available Providers & Models

AbstractCore supports multiple embedding providers:

### HuggingFace Provider (Default)

Local sentence-transformers models with optional ONNX acceleration (when available).

| Model | Size | Dimensions | Languages | Primary Use Cases |
|-------|------|------------|-----------|----------|
| **all-minilm** (default) | 90M | 384 | English | Fast local development, testing |
| **qwen3-embedding** | 1.5B | 1536 | 100+ | Qwen-based multilingual, instruction-tuned |
| **embeddinggemma** | 300M | 768 | 100+ | General purpose, multilingual |
| **granite** | 278M | 768 | 100+ | Enterprise applications |

```python
# Default: all-MiniLM-L6-v2 (fast and lightweight)
embedder = EmbeddingManager()

# Qwen-based embedding model for multilingual support
embedder = EmbeddingManager(model="qwen3-embedding")

# Google's EmbeddingGemma for multilingual support
embedder = EmbeddingManager(model="embeddinggemma")

# Direct HuggingFace model ID
embedder = EmbeddingManager(model="sentence-transformers/all-MiniLM-L6-v2")
```

### Ollama Provider

Local embedding models via Ollama API. Requires Ollama running locally.

```python
# Setup: Install Ollama and pull an embedding model
# ollama pull granite-embedding:278m

# Use Ollama embeddings
embedder = EmbeddingManager(
    provider="ollama",
    model="granite-embedding:278m"
)

# Other popular Ollama embedding models:
# - nomic-embed-text (274MB)
# - granite-embedding:107m (smaller, faster)
```

### LMStudio Provider

Local embedding models via LMStudio API. Requires LMStudio running with a loaded model.

```python
# Setup: Start LMStudio and load an embedding model

# Use LMStudio embeddings
embedder = EmbeddingManager(
    provider="lmstudio",
    model="text-embedding-all-minilm-l6-v2"
)
```

For embedding-only use, AbstractCore does not require the configured embedding
model to appear in the server's chat model catalogue. This keeps LM Studio,
vLLM, and generic OpenAI-compatible embeddings endpoints usable when `/models`
reports only loaded chat models or omits embedding models.

### OpenAI Provider

Cloud embedding models via the OpenAI API. Requires `OPENAI_API_KEY`.

```python
# Use OpenAI text-embedding-3-small (1536 dimensions)
embedder = EmbeddingManager(
    provider="openai",
    model="text-embedding-3-small"
)

# Or text-embedding-3-large (3072 dimensions, higher quality)
embedder = EmbeddingManager(
    provider="openai",
    model="text-embedding-3-large"
)
```

### Gateway Providers (OpenRouter, Portkey, OpenAI-compatible)

Route embedding requests through gateway/proxy providers. Useful for cost tracking, rate limiting, or unified billing.

```python
# Via OpenRouter gateway (requires OPENROUTER_API_KEY)
embedder = EmbeddingManager(
    provider="openrouter",
    model="openai/text-embedding-3-small"
)

# Via Portkey AI gateway (requires PORTKEY_API_KEY)
embedder = EmbeddingManager(
    provider="portkey",
    model="text-embedding-3-small"
)

# Via any OpenAI-compatible endpoint
embedder = EmbeddingManager(
    provider="openai-compatible",
    model="my-embedding-model"
)
```

Endpoint-backed embedding providers (`lmstudio`, `vllm`, and
`openai-compatible`) send requests directly to `/v1/embeddings` and skip eager
chat-model validation during embedding client setup. Request-time provider
errors still surface from the embedding call itself.

### Provider Comparison

| Provider | Speed | Setup | Privacy | Cost | Primary Use Cases |
|----------|-------|-------|---------|------|----------|
| **HuggingFace** | Fast | Easy | Full | Free | Development, production |
| **Ollama** | Medium | Medium | Full | Free | Privacy, custom models |
| **LMStudio** | Medium | Easy (GUI) | Full | Free | GUI management, testing |
| **OpenAI** | Fast | API key | Cloud | Paid | High-quality cloud embeddings |
| **OpenRouter** | Fast | API key | Cloud | Paid | Gateway routing, cost tracking |
| **Portkey** | Fast | API key | Cloud | Paid | Gateway routing, unified billing |
| **OpenAI-compatible** | Varies | URL | Varies | Varies | Custom endpoints, self-hosted |

## Core Features

### Single Text Embeddings

```python
embedder = EmbeddingManager()

text = "Python is a versatile programming language"
embedding = embedder.embed(text)

print(f"Text: {text}")
print(f"Embedding: {len(embedding)} dimensions")
print(f"First 5 values: {embedding[:5]}")
```

### Batch Processing (More Efficient)

```python
texts = [
    "Python programming language",
    "JavaScript for web development",
    "Machine learning with Python",
    "Data science and analytics"
]

# Process multiple texts at once (much faster)
embeddings = embedder.embed_batch(texts)

print(f"Generated {len(embeddings)} embeddings")
for i, embedding in enumerate(embeddings):
    print(f"Text {i+1}: {len(embedding)} dimensions")
```

### Similarity Analysis

```python
# Basic similarity between two texts
similarity = embedder.compute_similarity("cat", "kitten")
print(f"Similarity: {similarity:.3f}")  # 0.804

# NEW: Batch similarity - compare one text against many
query = "Python programming"
docs = ["Learn Python basics", "JavaScript guide", "Cooking recipes", "Data science with Python"]
similarities = embedder.compute_similarities(query, docs)
print(f"Batch similarities: {[f'{s:.3f}' for s in similarities]}")
# Output: ['0.785', '0.155', '0.145', '0.580']

# NEW: Similarity matrix - compare all texts against all texts
texts = ["Python programming", "JavaScript development", "Python data science", "Web frameworks"]
matrix = embedder.compute_similarities_matrix(texts)
print(f"Matrix shape: {matrix.shape}")  # (4, 4) symmetric matrix

# NEW: Asymmetric matrix for query-document matching
queries = ["Learn Python", "Web development guide"]
knowledge_base = ["Python tutorial", "JavaScript guide", "React framework", "Python for beginners"]
search_matrix = embedder.compute_similarities_matrix(queries, knowledge_base)
print(f"Search matrix: {search_matrix.shape}")  # (2, 4) - 2 queries × 4 documents
```

## Practical Applications

### Semantic Search

```python
from abstractcore.embeddings import EmbeddingManager

embedder = EmbeddingManager()

# Document collection
documents = [
    "Python is strong for data science and machine learning applications",
    "JavaScript enables interactive web pages and modern frontend development",
    "React is a popular library for building user interfaces with JavaScript",
    "SQL databases store and query structured data efficiently",
    "Machine learning algorithms can predict patterns from historical data"
]

def semantic_search(query, documents, top_k=3):
    """Find most relevant documents for a query."""
    similarities = []

    for i, doc in enumerate(documents):
        similarity = embedder.compute_similarity(query, doc)
        similarities.append((i, similarity, doc))

    # Sort by similarity (highest first)
    similarities.sort(key=lambda x: x[1], reverse=True)

    return similarities[:top_k]

# Search for relevant documents
query = "web development frameworks"
results = semantic_search(query, documents)

print(f"Query: {query}\n")
for rank, (idx, similarity, doc) in enumerate(results, 1):
    print(f"{rank}. Score: {similarity:.3f}")
    print(f"   {doc}\n")
```

### Simple RAG Pipeline

```python
from abstractcore import create_llm
from abstractcore.embeddings import EmbeddingManager

# Setup
embedder = EmbeddingManager()
llm = create_llm("openai", model="gpt-4o-mini")

# Knowledge base
knowledge_base = [
    "The Eiffel Tower is 330 meters tall and was completed in 1889.",
    "Paris is the capital city of France with over 2 million inhabitants.",
    "The Louvre Museum in Paris houses the famous Mona Lisa painting.",
    "French cuisine is known for its wine, cheese, and pastries.",
    "The Seine River flows through central Paris."
]

def rag_query(question, knowledge_base, llm, embedder):
    """Answer question using relevant context from knowledge base."""

    # Step 1: Find most relevant context
    similarities = []
    for doc in knowledge_base:
        similarity = embedder.compute_similarity(question, doc)
        similarities.append((similarity, doc))

    # Get top 2 most relevant documents
    similarities.sort(reverse=True)
    top_contexts = [doc for _, doc in similarities[:2]]
    context = "\n".join(top_contexts)

    # Step 2: Generate answer using context
    prompt = f"""Context:
{context}

Question: {question}

Based on the context above, please answer the question:"""

    response = llm.generate(prompt)
    return response.content, top_contexts

# Usage
question = "How tall is the Eiffel Tower?"
answer, contexts = rag_query(question, knowledge_base, llm, embedder)

print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"\nUsed context:")
for ctx in contexts:
    print(f"- {ctx}")
```

### Document Clustering (NEW)

```python
from abstractcore.embeddings import EmbeddingManager

embedder = EmbeddingManager()

# Documents to cluster
documents = [
    "Python programming tutorial for beginners",
    "Introduction to machine learning concepts",
    "JavaScript web development guide",
    "Advanced Python data structures",
    "Machine learning with neural networks",
    "Building web apps with JavaScript",
    "Python for data analysis",
    "Deep learning fundamentals",
    "React.js frontend development",
    "Statistical analysis with Python"
]

# NEW: Automatic semantic clustering
clusters = embedder.find_similar_clusters(
    documents,
    threshold=0.6,      # 60% similarity required
    min_cluster_size=2  # At least 2 documents per cluster
)

print(f"Found {len(clusters)} clusters:")
for i, cluster in enumerate(clusters):
    print(f"\nCluster {i+1} ({len(cluster)} documents):")
    for idx in cluster:
        print(f"  - {documents[idx]}")

# Example output:
# Cluster 1 (4 documents): Python-related content
# Cluster 2 (2 documents): JavaScript-related content
# Cluster 3 (2 documents): Machine learning content
```

## Performance Optimization

### ONNX Backend (optional)

```python
# Enable ONNX for faster inference
embedder = EmbeddingManager(
    model="embeddinggemma",
    backend="onnx"  # optional
)

# Performance comparison
import time

texts = ["Sample text for performance testing"] * 100

# Time the embedding generation
start_time = time.time()
embeddings = embedder.embed_batch(texts)
duration = time.time() - start_time

print(f"Generated {len(embeddings)} embeddings in {duration:.2f} seconds")
print(f"Speed: {len(embeddings)/duration:.1f} embeddings/second")
```

### Dimension Truncation (Memory/Speed Trade-off)

```python
# Truncate embeddings for faster processing
embedder = EmbeddingManager(
    model="embeddinggemma",
    output_dims=256  # Reduce from 768 to 256 dimensions
)

embedding = embedder.embed("Test text")
print(f"Truncated embedding dimension: {len(embedding)}")  # 256
```

### Advanced Caching (NEW)

```python
# Configure dual-layer caching system
embedder = EmbeddingManager(
    cache_size=5000,  # Larger memory cache
    cache_dir="./embeddings_cache"  # Persistent disk cache
)

# Regular embedding with standard caching
embedding1 = embedder.embed("Machine learning text")

# NEW: Normalized embedding with dedicated cache (unit-length vectors for cosine similarity)
normalized = embedder.embed_normalized("Machine learning text")
print(f"Normalized embedding length: {sum(x*x for x in normalized)**0.5:.3f}")  # 1.0 (unit length)

# Check comprehensive cache stats
stats = embedder.get_cache_stats()
print(f"Regular cache: {stats['persistent_cache_size']} embeddings")
print(f"Normalized cache: {stats['normalized_cache_size']} embeddings")
print(f"Memory cache hits: {stats['memory_cache_info']['hits']}")
```

## Integration with LLM Providers

### Enhanced Context Selection

```python
from abstractcore import create_llm
from abstractcore.embeddings import EmbeddingManager

def smart_context_selection(query, documents, max_context_length=2000):
    """Select most relevant context that fits within token limits."""
    embedder = EmbeddingManager()

    # Score all documents
    scored_docs = []
    for doc in documents:
        similarity = embedder.compute_similarity(query, doc)
        scored_docs.append((similarity, doc))

    # Sort by relevance
    scored_docs.sort(reverse=True)

    # Select documents that fit within context limit
    selected_context = ""
    for similarity, doc in scored_docs:
        test_context = selected_context + "\n" + doc
        if len(test_context) <= max_context_length:
            selected_context = test_context
        else:
            break

    return selected_context.strip()

# Usage with LLM
llm = create_llm("anthropic", model="claude-haiku-4-5")

documents = [
    "Long document about machine learning...",
    "Another document about data science...",
    # ... many more documents
]

query = "What is supervised learning?"
context = smart_context_selection(query, documents)

response = llm.generate(f"Context: {context}\n\nQuestion: {query}")
print(response.content)
```

### Multi-language Support

```python
# EmbeddingGemma supports 100+ languages
embedder = EmbeddingManager(model="embeddinggemma")

# Cross-language similarity
similarity = embedder.compute_similarity(
    "Hello world",      # English
    "Bonjour le monde"  # French
)
print(f"Cross-language similarity: {similarity:.3f}")

# Multilingual semantic search
documents_multilingual = [
    "Machine learning is transforming technology",  # English
    "L'intelligence artificielle change le monde",  # French
    "人工智能正在改变世界",                        # Chinese
    "Künstliche Intelligenz verändert die Welt"    # German
]

query = "artificial intelligence"
for doc in documents_multilingual:
    similarity = embedder.compute_similarity(query, doc)
    print(f"{similarity:.3f}: {doc}")
```

## Production Considerations

### Error Handling

```python
from abstractcore.embeddings import EmbeddingManager

def safe_embedding(text, embedder, fallback_value=None):
    """Generate embedding with error handling."""
    try:
        return embedder.embed(text)
    except Exception as e:
        print(f"Embedding failed for text: {text[:50]}...")
        print(f"Error: {e}")
        return fallback_value or [0.0] * 768  # Return zero vector as fallback

embedder = EmbeddingManager()

# Safe embedding generation
text = "Some text that might cause issues"
embedding = safe_embedding(text, embedder)

if embedding:
    print(f"Successfully generated embedding: {len(embedding)} dimensions")
else:
    print("Using fallback embedding")
```

### Monitoring and Metrics

```python
import time
from abstractcore.embeddings import EmbeddingManager

class MonitoredEmbeddingManager:
    def __init__(self, *args, **kwargs):
        self.embedder = EmbeddingManager(*args, **kwargs)
        self.stats = {
            'total_calls': 0,
            'total_time': 0,
            'cache_hits': 0,
            'cache_misses': 0
        }

    def embed(self, text):
        start_time = time.time()
        result = self.embedder.embed(text)
        duration = time.time() - start_time

        self.stats['total_calls'] += 1
        self.stats['total_time'] += duration

        return result

    def get_stats(self):
        avg_time = self.stats['total_time'] / max(self.stats['total_calls'], 1)
        return {
            **self.stats,
            'average_time': avg_time,
            'calls_per_second': 1 / avg_time if avg_time > 0 else 0
        }

# Usage
monitored_embedder = MonitoredEmbeddingManager()

# Generate some embeddings
for i in range(10):
    monitored_embedder.embed(f"Test text number {i}")

# Check performance
stats = monitored_embedder.get_stats()
print(f"Total calls: {stats['total_calls']}")
print(f"Average time per call: {stats['average_time']:.3f}s")
print(f"Calls per second: {stats['calls_per_second']:.1f}")
```

## When to Use Embeddings

### Good Use Cases

- **Semantic Search**: Find relevant documents based on meaning, not keywords
- **RAG Applications**: Select relevant context for language model queries
- **Content Recommendation**: Find similar articles, products, or content
- **Clustering**: Group similar documents or texts together
- **Duplicate Detection**: Find near-duplicate content
- **Multi-language Search**: Search across different languages

### Not Ideal For

- **Exact Matching**: Use traditional text search for exact matches
- **Structured Data**: Use SQL databases for structured queries
- **Real-time Critical Applications**: Embedding computation has latency
- **Very Short Texts**: Embeddings work better with meaningful content
- **High-frequency Operations**: Consider caching for repeated queries

## Using Embeddings via REST API

If you prefer HTTP endpoints over Python code, use the AbstractCore server:

```bash
# Start the server
pip install "abstractcore[server]"
abstractcore serve
```

**HTTP Request:**
```bash
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Machine learning is fascinating",
    "model": "huggingface/sentence-transformers/all-MiniLM-L6-v2"
  }'
```

**Model IDs via REST API (examples):**
- `huggingface/model-name`
- `ollama/model-name`
- `lmstudio/model-name`

**Complete REST API documentation:** [Server API Reference](server.md#embeddings)

## Provider-Specific Features

### HuggingFace Features
- **ONNX Acceleration** (when available)
- **Matryoshka Truncation**: Reduce dimensions for efficiency
- **Persistent Caching**: Automatic disk caching of embeddings

### Ollama Features
- **Simple Setup**: Just `ollama pull <model>`
- **Full Privacy**: No data leaves your machine
- **Custom Models**: Use any Ollama-compatible model

### LMStudio Features
- **GUI Management**: Easy model loading via GUI
- **Testing Friendly**: Suitable for experimentation
- **OpenAI Compatible**: Standard API format

## Next Steps

- **Start Simple**: Try the semantic search example with your own data
- **Experiment with Providers**: Compare HuggingFace, Ollama, and LMStudio
- **Optimize Performance**: Use batch processing and caching for production
- **Build RAG**: Combine embeddings with AbstractCore LLMs for RAG applications
- **Use REST API**: Deploy embeddings as HTTP service with the server

## Related Documentation

**Core Library:**
- **[Python API Reference](api-reference.md)** - Complete EmbeddingManager API
- **[Getting Started](getting-started.md)** - Basic AbstractCore setup
- **[Examples](examples.md)** - More practical examples

**Server (REST API):**
- **[Server Guide](server.md)** - Server setup and deployment
- **[Server API Reference](server.md)** - REST API endpoints including embeddings
- **[Troubleshooting](troubleshooting.md)** - Common embedding issues

---

**Remember**: Embeddings are the foundation for semantic understanding. Combined with AbstractCore's multi-provider LLM capabilities, you can build sophisticated AI applications that understand meaning, not just keywords.

---

### Inlined: `docs/centralized-config.md`

# AbstractCore Centralized Configuration

AbstractCore provides a unified configuration system that manages default models, cache directories, logging settings, and other package-wide preferences from a single location.

## Quick Setup

```bash
# Interactive guided setup (8 steps: model, vision, API keys, server auth, audio, video, embeddings, logging)
abstractcore --config

# Check readiness of all subsystems + download/install what's missing
abstractcore --install

# Auto-download everything missing (non-interactive)
abstractcore --install --yes

# View current configuration
abstractcore --status
```

## Configuration File Location

Configuration is stored in: `~/.abstractcore/config/abstractcore.json`

API keys saved via `--config` or `--set-api-key` are persisted here and automatically injected into the process environment (e.g. `OPENAI_API_KEY`) at startup. Environment variables always take precedence over config-persisted keys.

HTTP server settings saved via `--config` or the `--set-server-*` commands are also injected into the corresponding `ABSTRACTCORE_SERVER_*`, `HOST`, and `PORT` environment variables when those variables are not already set.

## Configuration Sections

### Application Defaults

Set default providers and models for specific AbstractCore applications:

```bash
# Set defaults for individual apps
abstractcore --set-app-default summarizer openai gpt-4o-mini
abstractcore --set-app-default cli anthropic claude-haiku-4-5
abstractcore --set-app-default extractor ollama qwen3:4b-instruct
abstractcore --set-app-default intent lmstudio qwen/qwen3-4b-2507

# View current app defaults
abstractcore --status
```

### Global Defaults

Set fallback defaults when app-specific configurations are not available:

```bash
# Set global fallback model
abstractcore --set-global-default lmstudio:qwen/qwen3.6-35b-a3b

# Set specialized defaults
abstractcore --set-chat-model openai/gpt-4o-mini
abstractcore --set-code-model anthropic/claude-haiku-4-5
```

`--set-global-default` writes the canonical `input.text` capability route
default. `output.text` is a read-only derived view of `input.text`, so text
understanding and text generation stay on the same LLM route. Commands that
target `output.text` are accepted for compatibility, but persist to
`input.text`.

### Capability Routing Defaults

Capability route defaults use `kind.modality` keys and store a small provider
target: `provider`, `model`, optional `base_url`, and provider/plugin `options`.

Route kinds:

- `input`: understanding/enrichment of request content
- `output`: generated content
- `embedding`: vectors for retrieval/indexes
- `rerank`: reserved for the future reranker manager

Examples:

```bash
abstractcore config set-default input.text \
  --provider lmstudio \
  --model qwen/qwen3.6-35b-a3b \
  --base-url http://127.0.0.1:1234/v1

abstractcore config set-default output.voice \
  --provider supertonic \
  --model supertonic-3 \
  --option voice=M1

abstractcore config defaults

abstractcore config clear-default output.voice
```

Route defaults are configuration only; they do not load a model into a provider.
Provider residency is reported separately.

`input.image` is a fallback route for image understanding when the configured
text route cannot natively accept images. If the `input.text` model is known in
AbstractCore's model-capability registry as image-capable, the effective
`input.image` route is marked as covered by `input.text` and should not be
edited separately.

`input.video` is the corresponding video-understanding route. If the
`input.text` model can process visual frames, Core may report `input.video` as
covered by `input.text`; unlike `input.image`, that coverage is overrideable so
you can route video through a dedicated VLM or video-capable endpoint.

`input.voice` is the speech-to-text fallback route. Core does not silently use a
locally installed STT package just because `audio.strategy=auto`; a text-only
model needs `input.voice` configured before audio attachments are transcribed
automatically.

`input.sound` is reserved for non-speech audio understanding: environmental
sound, SFX, audio scenes, and later music-understanding routes. Do not configure
STT-only models there. Source-backed open candidates in the model registry now
include `qwen3-omni-30b-a3b-instruct`,
`qwen3-omni-30b-a3b-captioner`, `qwen2.5-omni-7b`, and
`qwen2-audio-7b-instruct`. Qwen3.6 text/vision/video models remain
`audio_support=false`.

`input.music` is the corresponding music-audio understanding route. If the
configured `input.text` model is known to accept native music/audio input, Core
may report `input.sound` and `input.music` as covered by `input.text`; both rows
remain overrideable so a dedicated audio-understanding backend can be selected.

Use `abstractcore config set-default`, `abstractcore config defaults`, and
`abstractcore config clear-default` for capability route defaults. Capability
route defaults are intentionally managed through the `config` subcommand so
route discovery, provider profiles, and scoped config files share one explicit
interface.

By default Core writes to `~/.abstractcore/config/abstractcore.json`. Operators
and embedded hosts can target a specific Core config with
`ABSTRACTCORE_CONFIG_FILE`, `ABSTRACTCORE_CONFIG_DIR`, or:

```bash
abstractcore config --config-file /srv/runtime/config/abstractcore.json defaults
```

### Cache Directories

Configure cache locations for different components:

```bash
# Set cache directories
abstractcore --set-default-cache-dir ~/.cache/abstractcore
abstractcore --set-huggingface-cache-dir ~/.cache/huggingface
abstractcore --set-local-models-cache-dir ~/.abstractcore/models
```

**Default cache locations:**
- Default cache: `~/.cache/abstractcore`
- HuggingFace cache: `~/.cache/huggingface`
- Local models: `~/.abstractcore/models`

### Logging Configuration

Control logging behavior across all AbstractCore components:

#### Setting Log Levels

```bash
# Change console logging level (what you see in terminal)
abstractcore --set-console-log-level DEBUG    # Show all messages
abstractcore --set-console-log-level INFO     # Show info and above
abstractcore --set-console-log-level WARNING  # Show warnings and errors
abstractcore --set-console-log-level ERROR    # Show only errors (default)
abstractcore --set-console-log-level CRITICAL # Show only critical errors
abstractcore --set-console-log-level NONE     # Disable all console logging

# Change file logging level (when file logging is enabled)
abstractcore --set-file-log-level DEBUG
abstractcore --set-file-log-level INFO
abstractcore --set-file-log-level NONE       # Disable all file logging
```

#### File Logging Controls

```bash
# Enable/disable file logging
abstractcore --enable-file-logging      # Start saving logs to files
abstractcore --disable-file-logging     # Stop saving logs to files

# Set log file location
abstractcore --set-log-base-dir ~/.abstractcore/logs
abstractcore --set-log-base-dir /var/log/abstractcore
```

#### Quick Logging Commands

```bash
# Enable debug mode (sets both console and file to DEBUG)
abstractcore --enable-debug-logging

# Disable console output (keeps file logging if enabled)
abstractcore --disable-console-logging

# Check current logging settings
abstractcore --status  # Shows current levels with change commands
```

**Available log levels:** DEBUG, INFO, WARNING, ERROR, CRITICAL, NONE

**Log level descriptions:**
- **DEBUG**: Show all messages including detailed diagnostics
- **INFO**: Show informational messages and above
- **WARNING**: Show warnings, errors, and critical messages
- **ERROR**: Show only errors and critical messages
- **CRITICAL**: Show only critical errors
- **NONE**: Disable all logging completely

**Default logging settings:**
- Console level: ERROR
- File level: DEBUG
- File logging: Disabled by default
- Log base directory: `~/.abstractcore/logs`

### Vision (image fallback for text-only models)

Configure **vision fallback** (two-stage caption → inject observations) for text-only models:

```bash
# Set vision fallback provider/model
abstractcore --set-vision-provider huggingface Salesforce/blip-image-captioning-base

# Optional: add backups (used if the first vision backend fails)
abstractcore --add-vision-fallback lmstudio qwen/qwen3-vl-4b

# Disable vision fallback
abstractcore --disable-vision
```

Notes:
- `abstractcore --set-vision-caption ...` is deprecated but kept for compatibility.
- Vision fallback is only used for **image/video inputs** when the *main* model is text-only.

### Audio (default policy + explicit speech-to-text fallback)

Audio attachments are controlled by `audio_policy`. The default is `auto`: use native audio when supported, otherwise use the configured `input.voice` capability route as the speech-to-text fallback. If `input.voice` is not configured, Core reports a configuration error for text-only models instead of silently choosing an installed STT backend.

```bash
# STT fallback requires abstractvoice
pip install abstractvoice

# Select the STT route used by audio_policy=auto
abstractcore config set-default input.voice \
  --provider faster-whisper \
  --model large-v3

# Override strategy explicitly (auto is the default)
abstractcore --set-audio-strategy auto

# Optional: set a language hint (e.g. en, fr)
abstractcore --set-stt-language fr
```

Notes:
- `audio_policy="auto"` uses native audio when supported, otherwise the configured `input.voice` route.
- `audio_policy="speech_to_text"` forces STT and injects a transcript into the request; direct per-call routing may still provide explicit STT parameters.
- `audio_policy="native_only"` errors on text-only models (no fallback).

### Video (native vs configured frame/video fallback)

Video attachments are controlled by `video_policy`. By default (`auto`), AbstractCore uses native video input when supported. Otherwise it can sample frames via `ffmpeg` and route those frames through the selected `input.text` model when that model supports visual input, or through an explicit `input.video` route when configured.

```bash
abstractcore --set-video-strategy auto
abstractcore --set-video-max-frames 6
abstractcore --set-video-sampling-strategy keyframes
abstractcore --set-video-max-frame-side 1024

abstractcore config set-default input.video \
  --provider endpoint:office-vlm \
  --model qwen2.5-vl-72b
```

Notes:
- Frame sampling requires `ffmpeg`/`ffprobe` available on `PATH`.
- If the main model is text-only and `input.video` is not configured, Core reports a configuration error rather than silently using an unrelated vision/video backend.

### API Keys

Manage API keys for different providers:

```bash
# Set API keys
abstractcore --set-api-key openai sk-your-key-here
abstractcore --set-api-key anthropic your-anthropic-key
abstractcore --set-api-key openrouter your-openrouter-key
abstractcore --set-api-key portkey your-portkey-key
abstractcore --set-api-key openai-compatible your-endpoint-key
abstractcore --set-api-key vllm your-vllm-key

# List API key status
abstractcore --list-api-keys
```

Supported provider key names include `openai`, `anthropic`, `openrouter`, `portkey`, `openai-compatible`, `vllm`, and `google`.

### HTTP Server Gateway Auth

The optional OpenAI-compatible HTTP server has its own inbound auth token. This is separate from upstream provider keys.

```bash
# Set the AbstractCore server auth token
abstractcore --set-server-auth-token acore-server-secret

# Start the server; clients now authenticate with:
# Authorization: Bearer acore-server-secret
abstractcore serve
```

When a server auth token is configured, authenticated clients can use provider keys configured on the server through `abstractcore --set-api-key ...` or environment variables. To override one upstream provider for one request, clients can send `X-AbstractCore-Provider-API-Key`.

When no server auth token is configured, clients may bring their own upstream provider key with `Authorization: Bearer <provider-key>` or `X-AbstractCore-Provider-API-Key`, but they cannot use server-held provider keys. Provider keys in request bodies or query strings are disabled.

Server hardening commands:

```bash
# Local/dev only escape hatch
abstractcore --allow-unauthenticated-server
abstractcore --disallow-unauthenticated-server

# Allow non-loopback per-request base_url overrides
abstractcore --set-server-base-url-allowlist "https://example.com/v1,api.internal.local"

# Allow URL media fetches for otherwise blocked targets
abstractcore --set-server-url-fetch-allowlist "https://files.example.com"

# Safe local file path support for HTTP media attachments
abstractcore --set-server-media-root /srv/abstractcore-media

# Unsafe unrestricted local file support
abstractcore --allow-server-local-files
abstractcore --disallow-server-local-files

# Defaults for `abstractcore serve`
abstractcore --set-server-host 127.0.0.1
abstractcore --set-server-port 8000
```

The interactive wizard (`abstractcore --config`) asks for the same persisted server security surface: server auth token, unauthenticated local/dev mode, request `base_url` allowlist, URL media-fetch allowlist, safe media root, unrestricted local-file toggle, and default server host/port. It can generate an auth token and accepts `clear` for the key, allowlists, and media root.

### Streaming Configuration

Configure default streaming behavior for CLI:

```bash
# Set streaming behavior
abstractcore --stream on           # Enable streaming by default
abstractcore --stream off          # Disable streaming by default

# Alternative commands
abstractcore --enable-streaming    # Enable streaming by default
abstractcore --disable-streaming   # Disable streaming by default
```

**Note**: Streaming only affects CLI behavior. Apps (summarizer, extractor, judge, intent) don't support streaming because they need complete structured outputs.

## Priority System

AbstractCore uses a clear priority hierarchy for configuration:

1. **Explicit Parameters** (highest priority)
   ```bash
   summarizer document.txt --provider openai --model gpt-4o-mini
   ```

2. **App-Specific Configuration**
   ```bash
   abstractcore --set-app-default summarizer openai gpt-4o-mini
   ```

3. **Global Configuration**
   ```bash
   abstractcore --set-global-default openai/gpt-4o-mini
   ```

4. **Hardcoded Defaults** (lowest priority)
   - Used when no configuration is available
   - Current default: `huggingface/unsloth/Qwen3-4B-Instruct-2507-GGUF`

## Debug Mode

The `--debug` parameter overrides configured logging levels and shows detailed diagnostics:

```bash
# Enable debug mode in apps
summarizer document.txt --debug
extractor data.txt --debug

# Debug output shows:
# 🐛 Debug - Configuration details:
#    Provider: huggingface
#    Model: unsloth/Qwen3-4B-Instruct-2507-GGUF
#    Config source: configured defaults
#    Max tokens: 32000
#    ...
```

## Configuration Status

View complete configuration status:

```bash
abstractcore --status
```

This displays:
- Application defaults for each app
- Global fallback settings
- Vision configuration
- Embeddings settings
- API key status
- HTTP server gateway auth/hardening status
- Cache directories
- Logging configuration
- Configuration file location

## Interactive Configuration

Set up configuration interactively:

```bash
abstractcore --config
```

This guides you through:
- Default model selection
- Vision fallback setup
- API key configuration
- HTTP server gateway auth/hardening
- Audio/video fallback policies
- Embeddings setup
- Console logging verbosity

## Example Workflows

### Initial Setup

```bash
# 1. Check current status
abstractcore --status

# 2. Set global fallback
abstractcore --set-global-default ollama/llama3:8b

# 3. Configure specific apps for optimal performance
abstractcore --set-app-default summarizer openai gpt-4o-mini
abstractcore --set-app-default extractor ollama qwen3:4b-instruct
abstractcore --set-app-default judge anthropic claude-haiku-4-5

# 4. Set API keys as needed
abstractcore --set-api-key openai sk-your-key-here
abstractcore --set-api-key anthropic your-anthropic-key
# Optional (only if you plan to use the OpenRouter provider):
abstractcore --set-api-key openrouter your-openrouter-key
abstractcore --set-api-key openai-compatible your-endpoint-key
abstractcore --set-api-key vllm your-vllm-key

# 5. Optional: configure the OpenAI-compatible HTTP server gateway
abstractcore --set-server-auth-token acore-server-secret
abstractcore --set-server-base-url-allowlist "https://example.com/v1"

# 6. Configure logging for development
abstractcore --enable-debug-logging
abstractcore --enable-file-logging

# 7. Enable streaming for interactive CLI
abstractcore --stream on

# 8. Verify configuration
abstractcore --status
```

### Development Environment

```bash
# Enable verbose logging for development
abstractcore --set-console-log-level DEBUG
abstractcore --enable-file-logging
abstractcore --set-log-base-dir ./logs

# Use local models to avoid API costs
abstractcore --set-global-default ollama/llama3:8b
abstractcore --set-app-default summarizer ollama qwen3:4b-instruct
```

### Production Environment

```bash
# Use production API services
abstractcore --set-global-default openai/gpt-4o-mini
abstractcore --set-api-key openai $OPENAI_API_KEY

# Set production logging
abstractcore --set-console-log-level WARNING
abstractcore --set-file-log-level INFO
abstractcore --enable-file-logging
abstractcore --set-log-base-dir /var/log/abstractcore
```

## Configuration File Format

The configuration is stored as JSON in `~/.abstractcore/config/abstractcore.json`:

```json
{
  "vision": {
    "strategy": "two_stage",
    "caption_provider": "huggingface",
    "caption_model": "Salesforce/blip-image-captioning-base",
    "fallback_chain": [
      {
        "provider": "huggingface",
        "model": "Salesforce/blip-image-captioning-base"
      }
    ],
    "local_models_path": "~/.abstractcore/models/"
  },
  "audio": {
    "strategy": "native_only",
    "stt_backend_id": null,
    "stt_language": null,
    "caption_provider": null,
    "caption_model": null,
    "fallback_chain": []
  },
  "video": {
    "strategy": "auto",
    "max_frames": 3,
    "max_frames_native": 8,
    "frame_format": "jpg",
    "sampling_strategy": "uniform",
    "max_frame_side": 1024,
    "max_video_size_bytes": null
  },
  "embeddings": {
    "provider": "huggingface",
    "model": "all-minilm-l6-v2"
  },
  "app_defaults": {
    "cli_provider": "huggingface",
    "cli_model": "unsloth/Qwen3-4B-Instruct-2507-GGUF",
    "summarizer_provider": "openai",
    "summarizer_model": "gpt-4o-mini",
    "extractor_provider": "ollama",
    "extractor_model": "qwen3:4b-instruct",
    "judge_provider": "anthropic",
    "judge_model": "claude-haiku-4-5",
    "intent_provider": "lmstudio",
    "intent_model": "qwen/qwen3-4b-2507"
  },
  "default_models": {
    "global_provider": "ollama",
    "global_model": "llama3:8b",
    "chat_model": null,
    "code_model": null
  },
  "api_keys": {
    "openai": null,
    "anthropic": null,
    "openrouter": null,
    "portkey": null,
    "openai_compatible": null,
    "vllm": null,
    "google": null
  },
  "server": {
    "api_key": null,
    "allow_unauthenticated": false,
    "base_url_allowlist": null,
    "url_fetch_allowlist": null,
    "media_root": null,
    "allow_local_files": false,
    "host": null,
    "port": null
  },
  "cache": {
    "default_cache_dir": "~/.cache/abstractcore",
    "huggingface_cache_dir": "~/.cache/huggingface",
    "local_models_cache_dir": "~/.abstractcore/models",
    "glyph_cache_dir": "~/.abstractcore/glyph_cache"
  },
  "logging": {
    "console_level": "ERROR",
    "file_level": "DEBUG",
    "file_logging_enabled": false,
    "log_base_dir": null,
    "verbatim_enabled": true,
    "console_json": false,
    "file_json": true
  },
  "timeouts": {
    "default_timeout": 7200.0,
    "tool_timeout": 600.0
  },
  "offline": {
    "offline_first": true,
    "allow_network": false,
    "force_local_files_only": true
  },
  "streaming": {
    "cli_stream_default": false
  }
}
```

## Configuration Parameter Reference

### Vision Section
- **strategy**: Vision fallback strategy (`"two_stage"`, `"disabled"`, `"basic_metadata"`)
- **caption_provider**: Provider for vision model (e.g., `"huggingface"`, `"ollama"`)
- **caption_model**: Vision model name (e.g., `"Salesforce/blip-image-captioning-base"`)
- **fallback_chain**: Array of backup vision models to try if primary fails
- **local_models_path**: Directory for local vision model storage

### Audio Section
- **strategy**: Audio input strategy (`"native_only"`, `"speech_to_text"`, `"auto"`)
- **stt_backend_id**: Optional preferred STT backend id (plugin-specific)
- **stt_language**: Optional language hint for STT (e.g. `"en"`, `"fr"`)

### Video Section
- **strategy**: Video input strategy (`"native_only"`, `"frames_caption"`, `"auto"`)
- **max_frames**: Frame budget for frames-based fallback
- **max_frames_native**: Frame budget for native video-capable models
- **sampling_strategy**: `"uniform"` or `"keyframes"`
- **frame_format**: `"jpg"` or `"png"`
- **max_frame_side**: Downscale extracted frames to this max side length (preserves aspect ratio)
- **max_video_size_bytes**: Optional maximum video size allowed for processing (bytes)

### Default Models Section (Global Fallbacks)
- **global_provider** / **global_model**: Default provider/model when app-specific not set (e.g., `"ollama"` / `"llama3:8b"`)
- **chat_model**: Specialized model for chat applications (optional, `provider/model`)
- **code_model**: Specialized model for code generation (optional, `provider/model`)

### App Defaults Section (Per-Application)
- **cli_provider** / **cli_model**: Default for CLI utility
- **summarizer_provider** / **summarizer_model**: Default for document summarization
- **extractor_provider** / **extractor_model**: Default for entity extraction
- **judge_provider** / **judge_model**: Default for text evaluation
- **intent_provider** / **intent_model**: Default for intent analysis

### Embeddings Section
- **provider**: Embeddings provider (`"huggingface"`, `"openai"`, etc.)
- **model**: Embeddings model name (e.g., `"all-minilm-l6-v2"`)

### API Keys Section
- **openai**: OpenAI API key
- **anthropic**: Anthropic API key
- **openrouter**: OpenRouter API key
- **portkey**: Portkey API key
- **openai-compatible**: Generic OpenAI-compatible endpoint API key
- **vllm**: vLLM OpenAI-compatible server API key
- **google**: Google API key (reserved for future integrations; not required for current built-in providers)

### Server Section
- **auth_token**: AbstractCore server auth token for inbound client authentication
- **api_key**: Legacy status alias retained for compatibility with older callers
- **allow_unauthenticated**: Local/dev escape hatch for unauthenticated HTTP server requests
- **base_url_allowlist**: Additional non-loopback `base_url` override allowlist
- **url_fetch_allowlist**: URL media fetch allowlist for otherwise blocked targets
- **media_root**: Safe root for local media file paths accepted by the HTTP server
- **allow_local_files**: Unsafe unrestricted local file toggle
- **host** / **port**: Defaults for `abstractcore serve`

### Cache Section
- **default_cache_dir**: General cache directory for AbstractCore (`~/.cache/abstractcore`)
- **huggingface_cache_dir**: HuggingFace models cache (`~/.cache/huggingface`)
- **local_models_cache_dir**: Local models storage (`~/.abstractcore/models`)
- **glyph_cache_dir**: Glyph cache directory (`~/.abstractcore/glyph_cache`)

### Logging Section
- **console_level**: Console log level (`"DEBUG"`, `"INFO"`, `"WARNING"`, `"ERROR"`, `"CRITICAL"`, `"NONE"`)
- **file_level**: File log level (same options as console_level)
- **log_base_dir**: Directory for log files (`~/.abstractcore/logs`)
- **file_logging_enabled**: Whether to save logs to files (`true`/`false`)
- **verbatim_enabled**: Whether to capture full prompts/responses (`true`/`false`)
- **console_json**: Use JSON format for console output (`true`/`false`)
- **file_json**: Use JSON format for file output (`true`/`false`)

### Streaming Section
- **cli_stream_default**: Default streaming mode for CLI (`true`/`false`)

### Timeouts Section
- **default_timeout**: Default HTTP timeout for provider calls (seconds)
- **tool_timeout**: Default tool execution timeout (seconds)

### Offline Section
- **offline_first**: Default to offline-first behavior
- **allow_network**: Allow network access when offline-first is enabled (for API providers)
- **force_local_files_only**: Force HuggingFace `local_files_only` mode

## Common Configuration Tasks

### How to Change Console Log Level

If you see "Console Level: DEBUG" in the status and want to change it:

```bash
# To reduce console output (recommended for normal use)
abstractcore --set-console-log-level WARNING

# To see more information during development
abstractcore --set-console-log-level INFO

# To see all debug information
abstractcore --set-console-log-level DEBUG

# To completely disable console logging
abstractcore --set-console-log-level NONE

# Verify the change
abstractcore --status
```

### How to Enable File Logging

To start saving logs to files:

```bash
# Enable file logging (saves to ~/.abstractcore/logs by default)
abstractcore --enable-file-logging

# Optional: change log directory first
abstractcore --set-log-base-dir /path/to/your/logs
abstractcore --enable-file-logging

# Verify file logging is enabled
abstractcore --status
```

### How to Set Up Debug Mode

For troubleshooting, enable debug mode:

```bash
# Enable debug for both console and file logging
abstractcore --enable-debug-logging

# This is equivalent to:
# abstractcore --set-console-log-level DEBUG
# abstractcore --set-file-log-level DEBUG
# abstractcore --enable-file-logging
```

### How to Completely Disable Logging

To turn off all logging output:

```bash
# Disable console logging completely
abstractcore --set-console-log-level NONE

# Disable file logging completely (if enabled)
abstractcore --set-file-log-level NONE
abstractcore --disable-file-logging

# Note: --debug parameter in apps will still override NONE
# This maintains the priority system: explicit parameters > config defaults
```

## Troubleshooting

### Configuration Not Loading

If apps don't use configured defaults:

1. Check configuration file exists:
   ```bash
   ls -la ~/.abstractcore/config/abstractcore.json
   ```

2. Verify configuration content:
   ```bash
   abstractcore --status
   ```

3. Reset configuration if corrupted:
   ```bash
   rm ~/.abstractcore/config/abstractcore.json
   abstractcore --config
   ```

### Model Initialization Failures

When models fail to initialize, apps show configuration guidance:

```
[ERROR] Failed to initialize LLM 'openai/gpt-4o-mini': API key not configured

[INFO] Solutions:
   - Set API key: abstractcore --set-api-key openai sk-...
   - Use different provider: summarizer document.txt --provider ollama --model llama3:8b

🔧 Or configure a different default:
   - abstractcore --set-app-default summarizer ollama llama3:8b
   - abstractcore --status
```

### Debug Information

Use `--debug` to see detailed configuration information:

```bash
summarizer document.txt --debug
```

This shows:
- Which configuration source is being used
- Exact provider and model values
- All parameter values
- Configuration file location

---

### Inlined: `docs/server.md`

# AbstractCore Server

Transform AbstractCore into an OpenAI-compatible API server. One server, all models, any client.

If you want a dedicated **single-model** `/v1` server (one provider/model per worker), see [Endpoint](endpoint.md).

## Interactive API docs (start here)

Visit while the server is running:
- **Swagger UI**: `http://localhost:8000/docs`
- **ReDoc**: `http://localhost:8000/redoc`

## Quick Start

### Install and Run (2 minutes)

```bash
# Install
pip install "abstractcore[server]"

# Configure server auth and provider keys
export ABSTRACTCORE_AUTH_TOKEN="acore-server-secret"
export OPENAI_API_KEY="sk-..."

# Start server
abstractcore serve

# Or with uvicorn directly
uvicorn abstractcore.server.app:app --host 0.0.0.0 --port 8000

# Test
curl http://localhost:8000/health
# Response: {"status":"healthy"}
```

### First Request

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $ABSTRACTCORE_AUTH_TOKEN" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

Or with Python:

```python
import os
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key=os.environ["ABSTRACTCORE_AUTH_TOKEN"])

response = client.chat.completions.create(
    model="anthropic/claude-haiku-4-5",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)
```

---

## Configuration

### Environment Variables

```bash
# Provider API keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENROUTER_API_KEY="sk-or-..."
export PORTKEY_API_KEY="pk_..."         # optional (Portkey)
export PORTKEY_CONFIG="pcfg_..."        # required for Portkey routing

# Server auth token. Authenticated clients can use all server-configured providers.
export ABSTRACTCORE_AUTH_TOKEN="acore-server-secret"

# Local providers
export OLLAMA_BASE_URL="http://localhost:11434"          # (or legacy: OLLAMA_HOST)
export LMSTUDIO_BASE_URL="http://localhost:1234/v1"
export VLLM_BASE_URL="http://localhost:8000/v1"

# Server bind (used by `abstractcore serve` and compatibility module entrypoints)
export HOST="0.0.0.0"
export PORT="8000"

# Debug mode
export ABSTRACTCORE_DEBUG=true

# Dangerous (multi-tenant hazard): allow unload_after for providers that can unload shared server state (e.g. Ollama)
export ABSTRACTCORE_ALLOW_UNSAFE_UNLOAD_AFTER=1

# Server security controls (recommended)
#
# - Request-level base_url overrides are loopback-only by default.
#   URL entries match scheme + exact host + default/explicit port + path-segment prefix.
#   Bare entries match hostname globs, e.g. "*.example.com".
export ABSTRACTCORE_SERVER_BASE_URL_ALLOWLIST="https://api.openai.com,https://example.com/v1"
#
# - Remote URL fetches for attachments are blocked for private/loopback/link-local targets by default (SSRF protection).
#   To allow specific hosts/prefixes, use the same structured allowlist syntax:
export ABSTRACTCORE_SERVER_URL_FETCH_ALLOWLIST="https://www.berkshirehathaway.com"
#
# - Local file paths in HTTP requests are disabled by default (including @/path/to/file in message strings).
#   To allow local file paths safely, restrict them under a single directory:
export ABSTRACTCORE_SERVER_MEDIA_ROOT="/srv/abstractcore-media"
#
# - Unsafe escape hatch: allow arbitrary local file paths from HTTP requests (not recommended)
export ABSTRACTCORE_SERVER_ALLOW_LOCAL_FILES=1
```

### Startup Options

```bash
# Using AbstractCore's built-in CLI
abstractcore serve --help                    # View all options
abstractcore serve --debug                   # Debug mode
abstractcore serve --host 127.0.0.1 --port 8080  # Custom host/port
abstractcore serve --debug --port 8001       # Debug on custom port

# Using uvicorn directly
uvicorn abstractcore.server.app:app --reload                # Development with auto-reload
uvicorn abstractcore.server.app:app --workers 4             # Production with multiple workers
uvicorn abstractcore.server.app:app --port 3000             # Custom port
```

---

## API Endpoints

### Chat Completions

**Endpoint:** `POST /v1/chat/completions`

Standard OpenAI-compatible endpoint. Works with all providers.

Server auth:
- If `ABSTRACTCORE_AUTH_TOKEN` is configured, every non-health endpoint requires
  `Authorization: Bearer $ABSTRACTCORE_AUTH_TOKEN`. Authenticated clients can use all
  provider keys/endpoints configured on the server.
- If `ABSTRACTCORE_AUTH_TOKEN` is not configured, `Authorization: Bearer <provider-key>`
  may be used as a bring-your-own upstream provider key. That key is forwarded only to the
  requested provider and never unlocks server-configured provider keys.
- Health checks (`GET /health`) are always unauthenticated.

**Request:**
```json
{
  "model": "provider/model-name",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 1000,
  "stream": false
}
```

**Key Parameters:**
- `model` (required): Prefer `"provider/model-name"` (e.g., `"openai/gpt-4o-mini"`). If you pass a bare model name (no `/`), the server will best-effort auto-detect a provider.
- `messages` (required): Array of message objects
- `stream` (optional): Enable streaming responses
- `tools` (optional): Tools for function calling
- `agent_format` (optional, AbstractCore extension): Tool-call syntax output format for agentic clients (`"auto"|"openai"|"codex"|"qwen3"|"llama3"|"gemma"|"xml"|"passthrough"`). When omitted, the server auto-detects from user-agent + model heuristics.
- `api_key` (deprecated/disabled, AbstractCore extension): Provider API keys are not accepted in request bodies. Configure provider keys on the server, use `X-AbstractCore-Provider-API-Key` for a per-request provider override, or use `Authorization` as a provider key only when `ABSTRACTCORE_AUTH_TOKEN` is not configured. Select discovery endpoints also accept an `api_key` query parameter for tooling/Swagger UI convenience.
- `base_url` (optional, AbstractCore extension): Override the provider endpoint (include `/v1` for OpenAI-compatible servers like LM Studio / vLLM / OpenRouter)
- `unload_after` (optional, AbstractCore extension): If `true`, calls `llm.unload_model(model)` after the request completes. Disabled for `ollama/*` unless `ABSTRACTCORE_ALLOW_UNSAFE_UNLOAD_AFTER=1`.
- `prompt_cache_key` (optional, AbstractCore extension): Best-effort prompt caching key (semantics depend on provider/backend). See `docs/prompt-caching.md`.
- `prompt_cache_binding` (optional, AbstractCore extension): Exact durable bloc binding returned by `/acore/blocs/kv/load`. When supplied, the server verifies the cache key before generation or streaming; stale/missing bindings return `409`.
- `prompt_cache_retention` (optional, AbstractCore extension): Prompt cache retention policy (OpenAI: `"in_memory"` or `"24h"`; ignored by other providers). See `docs/prompt-caching.md`.
- `thinking` (optional, AbstractCore extension): Unified thinking/reasoning control (`null|"auto"|"on"|"off"|"none"` or `"low"|"medium"|"high"|"xhigh"` when supported). Note: `"none"` is treated as an alias for `"off"`.
- `temperature`, `max_tokens`, `top_p`: Standard LLM parameters

#### Thinking (AbstractCore extension)

The server forwards `thinking` to the underlying provider using AbstractCore’s unified thinking mapping (see [Generation Parameters](generation-parameters.md)).

Example (route to LM Studio + Qwen3.5, disable thinking):

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lmstudio/qwen3.5-27b@q4_k_m",
    "base_url": "http://localhost:1234/v1",
    "messages": [{"role": "user", "content": "Compute 17*23 - 19*11. Reply with the integer only."}],
    "thinking": "none",
    "max_tokens": 64
  }'
```

Notes:
- For **Qwen3 / Qwen3.5 on LM Studio**, `thinking="none"` maps to LM Studio’s template variables (`enable_thinking` / `enableThinking`) plus a Qwen template “hard switch” fallback (empty `<think></think>`) when needed. This avoids injecting “reasoning effort” instructions into the system prompt.
- Not every backend supports per-effort budgets for `low|medium|high`; when unavailable, levels degrade to “thinking enabled”.

**Example with streaming:**

```python
import os
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key=os.environ["ABSTRACTCORE_AUTH_TOKEN"])

stream = client.chat.completions.create(
    model="ollama/qwen3-coder:30b",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

#### Provider `base_url` override (AbstractCore extension)

Route a provider to a specific endpoint (useful for remote OpenAI-compatible servers):

Security notes:
- Request-level `base_url` overrides are **loopback-only by default**. To allow additional
  origins or host globs, set `ABSTRACTCORE_SERVER_BASE_URL_ALLOWLIST`. URL entries are parsed
  and matched on scheme, exact host, effective port, and path-segment prefix.
- If the server has an environment provider key set (e.g. `OPENAI_API_KEY`) and you route to a **non-loopback** `base_url`, the request is refused unless the provider key was supplied explicitly with `X-AbstractCore-Provider-API-Key`, or with `Authorization` when server auth is disabled.

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lmstudio/qwen/qwen3-4b-2507",
    "base_url": "http://localhost:1234/v1",
    "messages": [{"role": "user", "content": "Hello from a remote LM Studio endpoint"}]
  }'
```

#### Provider Authentication

Do not put provider keys in request bodies. Those fields are disabled because they leak through
logs, shell history, browser history, and reverse proxies.

```bash
# Preferred: configure provider keys on the server and authenticate to AbstractCore.
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $ABSTRACTCORE_AUTH_TOKEN" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

When `ABSTRACTCORE_AUTH_TOKEN` is not configured, `Authorization: Bearer <provider-key>` may
be used as an upstream provider key. Once server auth is enabled, `Authorization` is reserved for
the AbstractCore server auth token and is never forwarded upstream.

To override a single upstream provider while still using the server auth token, send the provider
key in `X-AbstractCore-Provider-API-Key`. The override applies only to the requested provider:

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $ABSTRACTCORE_AUTH_TOKEN" \
  -H "X-AbstractCore-Provider-API-Key: $ANTHROPIC_API_KEY" \
  -d '{
    "model": "anthropic/claude-haiku-4-5",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

### Media generation endpoints (optional)

AbstractCore Server can optionally expose OpenAI-compatible **image/video generation** and **audio** endpoints.

Important notes:
- These are **interoperability-first** endpoints (return `b64_json` or raw bytes), not an artifact-first durability contract.
- If the required plugin/backend is not available, the server returns `501` with actionable messaging.

#### Images/video (generate/edit) — requires `abstractvision`

Endpoints:
- `POST /v1/images/generations`
- `POST /v1/images/edits`
- `POST /v1/images/upscale`
- `POST /v1/videos/generations`
- `POST /v1/videos/edits`
- `POST /v1/vision/jobs/images/generations`
- `POST /v1/vision/jobs/images/edits`
- `POST /v1/vision/jobs/images/upscale`
- `POST /v1/vision/jobs/videos/generations`
- `POST /v1/vision/jobs/videos/edits`
- `/v1/vision/*` catalog and model residency surfaces, including video-capable local models such as `mlx-gen/AbstractFramework/wan2.2-t2v-a14b-diffusers-8bit` and `mlx-gen/AbstractFramework/wan2.2-i2v-a14b-diffusers-8bit`

Python `generate(..., output={"task":"text_to_video"})` and
`generate(..., output={"task":"image_to_video"})` use the same Core output
dispatcher. Wan A14B video requests can set typed `guidance_2` beside
`guidance_scale`, while backend-only values such as `max_sequence_length` stay
in `extra`. A top-level `on_progress`, `progress_event_callback`, or
`progress_callback` kwarg is attached to generated image/video output specs and
reaches AbstractVision; async video job polling exposes `progress.last_event`
when the backend reports rich events.

Image upscaling uses the same async job model through
`POST /v1/vision/jobs/images/upscale`. Send multipart fields such as
`provider=mlx-gen`, `model=AbstractFramework/seedvr2-3b-8bit`,
`image=@./input.png`, and `scale=2x`, then poll
`GET /v1/vision/jobs/{job_id}`. MLX-Gen reports denoise-step progress in
`progress.last_event`; for SeedVR2, use `AbstractFramework/seedvr2-3b-8bit`
by default, `AbstractFramework/seedvr2-7b-8bit` when memory allows, or the
matching q4 package when memory is tight.

Python calls for the same path:

```python
def on_progress(event):
    print(event)

direct_png = llm.vision.upscale_image(
    "input.png",
    provider="mlx-gen",
    model="AbstractFramework/seedvr2-3b-8bit",
    scale="2x",
    on_progress=on_progress,
)

resp = llm.generate(
    media={"type": "image", "path": "input.png", "role": "source"},
    on_progress=on_progress,
    output={
        "task": "image_upscale",
        "provider": "mlx-gen",
        "model": "AbstractFramework/seedvr2-3b-8bit",
        "scale": "2x",
    },
)
png = resp.outputs["image"][0].data
```

Install:
```bash
pip install "abstractcore[server]"
pip install abstractvision
```

#### Audio (STT/TTS) — requires an audio/voice capability plugin (typically `abstractvoice`)

Endpoints:
- `POST /v1/audio/transcriptions` (multipart; `file=...`)
- `POST /v1/audio/speech` (json; `input=...`, optional `voice`, optional `format`)

Install:
```bash
pip install "abstractcore[server]"
pip install abstractvoice
```

Notes:
- `/v1/audio/transcriptions` requires `python-multipart` for form parsing (included in the server extra).

Examples:

```bash
# Speech-to-text (STT)
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@speech.wav" \
  -F "language=en"

# Text-to-speech (TTS)
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello!","format":"wav"}' \
  --output hello.wav
```

If you want to “ask a model about an audio file”, prefer one of:
- Run STT first (`/v1/audio/transcriptions`) then send the transcript to `POST /v1/chat/completions`, or
- Configure the server’s default audio strategy (`config.audio.strategy`) to enable STT fallback for audio attachments, then attach audio in chat requests.

### Multimodal Requests (Images, Documents, Files)

AbstractCore server supports comprehensive file attachments using OpenAI-compatible multimodal message format, plus AbstractCore's convenient `@filename` syntax.

Security note (HTTP server): local file paths are disabled by default (including `@/path/to/file` and `{"url": "/path/to/file"}`).
Use `http(s)` URLs or `data:` base64, or enable local paths via `ABSTRACTCORE_SERVER_MEDIA_ROOT` (safe) / `ABSTRACTCORE_SERVER_ALLOW_LOCAL_FILES=1` (unsafe).

#### Supported File Types

- **Images**: PNG, JPEG, GIF, WEBP, BMP, TIFF
- **Documents**: PDF, DOCX, XLSX, PPTX
- **Data/Text**: CSV, TSV, TXT, MD, JSON, XML
- **Size Limits**: 10MB per file, 32MB total per request

#### Method 1: @filename Syntax (AbstractCore Extension)

Simple syntax that works with all providers (requires local paths enabled via `ABSTRACTCORE_SERVER_MEDIA_ROOT` or `ABSTRACTCORE_SERVER_ALLOW_LOCAL_FILES=1`):

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "messages": [
      {"role": "user", "content": "What is in this document? @/path/to/report.pdf"}
    ]
  }'
```

#### Method 2: OpenAI Vision API Format (Image URLs)

Standard OpenAI format for images:

```json
{
  "model": "anthropic/claude-haiku-4-5",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/image.jpg"
          }
        }
      ]
    }
  ]
}
```

**Base64 Images:**
```json
{
  "type": "image_url",
  "image_url": {
    "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
  }
}
```

#### Method 3: OpenAI File Format (Forward-Compatible)

AbstractCore supports OpenAI's planned file format with simplified structure (consistent with image_url):

**File URL Format (Recommended - Same Pattern as image_url):**
```json
{
  "model": "ollama/qwen3:4b",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Analyze this document"},
        {
          "type": "file",
          "file_url": {
            "url": "https://example.com/documents/report.pdf"
          }
        }
      ]
    }
  ]
}
```

**Local File Path:**
```json
{
  "type": "file",
  "file_url": {
    "url": "/Users/username/documents/data.csv"
  }
}
```

Note: local file paths require `ABSTRACTCORE_SERVER_MEDIA_ROOT` (safe) or `ABSTRACTCORE_SERVER_ALLOW_LOCAL_FILES=1` (unsafe) on the server.

**Base64 Data URL:**
```json
{
  "type": "file",
  "file_url": {
    "url": "data:application/pdf;base64,JVBERi0xLjQKMSAwIG9iago<PAovVHlwZS..."
  }
}
```

**Filename Extraction:**
- **URLs/Paths**: Extracted automatically (`/path/file.pdf` → `file.pdf`)
- **Base64**: Generated from MIME type (`data:application/pdf;base64,...` → `document.pdf`)

#### Mixed Content Example

Combine text, images, and documents in a single request:

```json
{
  "model": "openai/gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Compare this chart with the data in the spreadsheet"},
        {
          "type": "image_url",
          "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANS..."}
        },
        {
          "type": "file",
          "file_url": {
            "url": "https://example.com/data/sales_data.xlsx"
          }
        }
      ]
    }
  ]
}
```

#### Python Client Examples

**Using OpenAI Client:**
```python
import os
from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key=os.environ["ABSTRACTCORE_AUTH_TOKEN"])

# Method 1: @filename syntax
response = client.chat.completions.create(
    model="anthropic/claude-haiku-4-5",
    messages=[{"role": "user", "content": "Summarize @document.pdf"}]
)

# Method 2: File URL (HTTP/HTTPS)
response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What are the key findings?"},
            {
                "type": "file",
                "file_url": {
                    "url": "https://example.com/documents/report.pdf"
                }
            }
        ]
    }]
)

# Method 3: Local file path
response = client.chat.completions.create(
    model="anthropic/claude-haiku-4-5",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze this local document"},
            {
                "type": "file",
                "file_url": {
                    "url": "/Users/username/documents/report.pdf"
                }
            }
        ]
    }]
)

# Method 4: Base64 data URL
with open("report.pdf", "rb") as f:
    file_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="lmstudio/qwen/qwen3-next-80b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What are the key findings?"},
            {
                "type": "file",
                "file_url": {
                    "url": f"data:application/pdf;base64,{file_data}"
                }
            }
        ]
    }]
)
```

**Universal Provider Support:**
```python
# Same syntax works across all providers
providers_models = [
    "openai/gpt-4o",
    "anthropic/claude-haiku-4-5",
    "ollama/qwen2.5vl:7b",
    "lmstudio/qwen/qwen2.5-vl-7b"
]

for model in providers_models:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "Analyze @data.csv and @chart.png"}]
    )
    print(f"{model}: {response.choices[0].message.content[:100]}...")
```

---

### OpenAI Responses API

**Endpoint:** `POST /v1/responses`

AbstractCore implements an OpenAI-compatible Responses-style API, including `input_file` support.

#### Why Use /v1/responses?

- **OpenAI Compatible**: Accepts OpenAI Responses API requests and returns an OpenAI Responses `object: "response"` payload
- **Native File Support**: `input_file` type designed specifically for document attachments
- **Cleaner API**: Explicit separation between text (`input_text`) and files (`input_file`)
- **Backward Compatible**: Existing `messages` format still works alongside new `input` format
- **Optional Streaming**: `"stream": true` streams OpenAI Responses events (OpenAI format) or chat-completions chunks (legacy format)

#### Request Format

**OpenAI Responses API Format (Recommended):**
```json
{
  "model": "gpt-4o",
  "input": [
    {
      "role": "user",
      "content": [
        {"type": "input_text", "text": "Analyze this document"},
        {"type": "input_file", "file_url": "https://example.com/report.pdf"}
      ]
    }
  ],
  "tools": [
    {"type": "web_search", "external_web_access": true}
  ],
  "tool_choice": "auto",
  "stream": false,
  "max_output_tokens": 2000,
  "temperature": 0.7,
  "thinking": "off",
  "prompt_cache_key": "tenantA:doc-review"
}
```

OpenAI-format `/v1/responses` requests accept the same shared text-inference extensions as `/v1/chat/completions` where they apply, including `stop`, `seed`, `frequency_penalty`, `presence_penalty`, `base_url`, `agent_format`, `thinking`, `prompt_cache_key`, `prompt_cache_retention`, `timeout_s`, and `unload_after`.

**Legacy Format (Still Supported):**
```json
{
  "model": "openai/gpt-4",
  "messages": [
    {"role": "user", "content": "Tell me a story"}
  ],
  "stream": false
}
```

#### Automatic Format Detection

The server automatically detects which format you're using:
- **OpenAI Format**: Presence of `input` field → converts to internal format
- **Legacy Format**: Presence of `messages` field → processes directly
- **Error**: Missing both fields → returns 400 error with clear message

#### Examples

**Simple Text Request:**
```bash
curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lmstudio/qwen/qwen3-next-80b",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "What is Python?"}
        ]
      }
    ]
  }'
```

**File Analysis:**
```bash
curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "Analyze the letter and summarize key points"},
          {"type": "input_file", "file_url": "https://www.berkshirehathaway.com/letters/2024ltr.pdf"}
        ]
      }
    ]
  }'
```

**Multiple Files:**
```bash
curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-haiku-4-5",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "Compare these documents"},
          {"type": "input_file", "file_url": "https://example.com/report1.pdf"},
          {"type": "input_file", "file_url": "https://example.com/report2.pdf"},
          {"type": "input_file", "file_url": "https://example.com/chart.png"}
        ]
      }
    ],
    "max_tokens": 2000
  }'
```

**Streaming Response:**
```bash
curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "Summarize this document"},
          {"type": "input_file", "file_url": "https://example.com/document.pdf"}
        ]
      }
    ],
    "stream": true
  }' --no-buffer
```

#### Supported Media Types

All file types supported via URL, local path, or base64:

- **Documents**: PDF, DOCX, XLSX, PPTX
- **Data Files**: CSV, TSV, JSON, XML
- **Text Files**: TXT, MD
- **Images**: PNG, JPEG, GIF, WEBP, BMP, TIFF
- **Size Limits**: 10MB per file, 32MB total per request

**Source Options:**
```json
// HTTP/HTTPS URL
{"type": "input_file", "file_url": "https://example.com/report.pdf"}

// Local file path
{"type": "input_file", "file_url": "/path/to/document.xlsx"}

// Base64 data URL
{"type": "input_file", "file_url": "data:application/pdf;base64,JVBERi0x..."}
```

#### Python Client Example

```python
import os
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key=os.environ["ABSTRACTCORE_AUTH_TOKEN"])

# Direct request to /v1/responses endpoint
import requests

response = requests.post(
    "http://localhost:8000/v1/responses",
    json={
        "model": "gpt-4o",
        "input": [
            {
                "role": "user",
                "content": [
                    {"type": "input_text", "text": "Analyze this document"},
                    {"type": "input_file", "file_url": "https://example.com/report.pdf"}
                ]
            }
        ]
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])
```

---

### Embeddings

**Endpoint:** `POST /v1/embeddings`

Generate embedding vectors for semantic search, RAG, and similarity analysis.

**Request:**
```json
{
  "input": "Text to embed",
  "model": "huggingface/sentence-transformers/all-MiniLM-L6-v2"
}
```

**Supported Providers:**
- **HuggingFace**: Local models with ONNX acceleration
- **Ollama**: `ollama/granite-embedding:278m`, etc.
- **LMStudio**: Any loaded embedding model
- **OpenAI**: `openai/text-embedding-3-small`, `openai/text-embedding-3-large`
- **OpenRouter**: `openrouter/openai/text-embedding-3-small`, etc.
- **Portkey**: `portkey/...` with your Portkey routing configuration
- **OpenAI-compatible**: `openai-compatible/...` against configured/local `/v1/embeddings` endpoints

For endpoint-backed providers such as LM Studio, vLLM, and generic
OpenAI-compatible servers, the embedding route does not require the embedding
model to appear in a chat model catalogue before the request is sent.

**Batch Embedding:**
```bash
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["text 1", "text 2", "text 3"],
    "model": "ollama/granite-embedding:278m"
  }'
```

---

### Model Discovery

**Endpoint:** `GET /v1/models`

List all available models from configured providers.

**Query Parameters:**
- `provider`: Filter by provider (e.g., `ollama`, `openai`, `anthropic`, `lmstudio`, `openai-compatible`).
- `input_type`: Legacy broad input filter such as `text`, `image`, `audio`, or `video`.
- `output_type`: Legacy broad output filter such as `text` or `embeddings`.
- `capability_route`: Precise route-key filter. Repeat the query parameter or
  comma-separate values, for example `capability_route=input.image,output.text`
  or `capability_route=embedding.text`. Route keys use `<kind>.<modality>` and
  are normalized through the same vocabulary as capability defaults.

**Examples:**
```bash
# All models
curl http://localhost:8000/v1/models

# Ollama models only
curl http://localhost:8000/v1/models?provider=ollama

# Embedding models only
curl http://localhost:8000/v1/models?output_type=embeddings

# Text embedding models only, using the route filter
curl http://localhost:8000/v1/models?capability_route=embedding.text

# Vision-capable input models
curl http://localhost:8000/v1/models?input_type=image

# Vision models that return text, using precise route keys
curl 'http://localhost:8000/v1/models?capability_route=input.image,output.text'

# Ollama embeddings
curl http://localhost:8000/v1/models?provider=ollama&output_type=embeddings
```

---

### Provider Status

**Endpoint:** `GET /providers`

List all available providers and their status.

**Response:**
```json
{
  "providers": [
    {
      "name": "ollama",
      "type": "llm",
      "model_count": 15,
      "status": "available"
    }
  ]
}
```

---

### Health Check

**Endpoint:** `GET /health`

Server health check for monitoring.

**Response:** `{"status": "healthy"}`

---

## Agentic CLI integration

AbstractCore Server is **OpenAI-compatible**. Most OpenAI-compatible CLIs/SDKs can be pointed at it by setting:

- `OPENAI_BASE_URL="http://localhost:8000/v1"` (or an equivalent flag)
- `OPENAI_API_KEY="unused"` (many clients require a non-empty key even for local servers)

### Tool calling interoperability

- The server **does not execute tools** (it always returns tool calls; your host/runtime executes them).
- It can emit tool calls either as structured `tool_calls` (OpenAI/Codex style) **or** as tagged content for clients that parse tool calls from assistant text.
- Control the output format with `agent_format` (request body, AbstractCore extension), or rely on auto-detection (user-agent + model heuristics).

Supported `agent_format` values: `auto`, `openai`, `codex`, `qwen3`, `llama3`, `gemma`, `xml`, `passthrough`.

### Codex CLI (example)

```bash
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="unused"

codex --model "ollama/qwen3-coder:30b" "Write a factorial function"
```

### Forcing a format (curl)

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/qwen3:4b-instruct-2507-q4_K_M",
    "messages": [{"role": "user", "content": "Use the tool."}],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get weather by city",
          "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
          }
        }
      }
    ],
    "agent_format": "llama3"
  }'
```

---

## Deployment

### Docker

```dockerfile
FROM python:3.9-slim

RUN pip install "abstractcore[server]"

EXPOSE 8000

CMD ["uvicorn", "abstractcore.server.app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
```

**Run:**
```bash
docker build -t abstractcore-server .
docker run -p 8000:8000 \
  -e ABSTRACTCORE_AUTH_TOKEN=$ABSTRACTCORE_AUTH_TOKEN \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  abstractcore-server
```

### Docker Compose

```yaml
version: '3.8'

services:
  abstractcore:
    image: abstractcore-server:latest
    ports:
      - "8000:8000"
    environment:
      - ABSTRACTCORE_AUTH_TOKEN=${ABSTRACTCORE_AUTH_TOKEN}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    restart: unless-stopped
```

### Production with Gunicorn

```bash
pip install gunicorn

gunicorn abstractcore.server.app:app \
  --worker-class uvicorn.workers.UvicornWorker \
  --workers 4 \
  --bind 0.0.0.0:8000
```

---

## Debug and Monitoring

### Enable Debug Mode

Debug mode provides comprehensive logging and detailed error reporting for troubleshooting API issues.

```bash
# Method 1: Using command line flag (recommended)
abstractcore serve --debug

# Method 2: Using environment variable
export ABSTRACTCORE_DEBUG=true
abstractcore serve

# Method 3: With uvicorn directly
export ABSTRACTCORE_DEBUG=true
uvicorn abstractcore.server.app:app --host 0.0.0.0 --port 8000
```

### Debug Features

**Enhanced Error Reporting:**
- **Before**: Uninformative "422 Unprocessable Entity" messages
- **After**: Detailed field validation errors with request body capture

**Example Debug Output:**
```json
🔴 Request Validation Error (422) | method=POST | error_count=2 | errors=[
  {"field": "body -> model", "message": "Field required", "type": "missing"},
  {"field": "body -> messages", "message": "Field required", "type": "missing"}
] | client=127.0.0.1

📋 Request Body (Validation Error) | body={"invalid": "data"}
```

**Request/Response Tracking:**
- Full HTTP request details (method, URL, headers, client IP)
- Response status codes and processing times
- Structured JSON logging for machine processing

**Log Files:**
- `logs/abstractcore_TIMESTAMP.log` - Structured events
- `logs/YYYYMMDD-payloads.jsonl` - Full request bodies
- `logs/verbatim_TIMESTAMP.jsonl` - Complete I/O

**Useful Commands:**
```bash
# Find errors
grep '"level": "error"' logs/abstractcore_*.log

# Track token usage
cat logs/verbatim_*.jsonl | jq '.metadata.tokens | .input + .output' | \
  awk '{sum+=$1} END {print "Total:", sum}'

# Monitor specific model
grep '"model": "qwen3-coder:30b"' logs/verbatim_*.jsonl
```

## Common Patterns

### Multi-Provider Fallback

```python
import requests

providers = [
    "ollama/qwen3-coder:30b",
    "openai/gpt-4o-mini",
    "anthropic/claude-haiku-4-5"
]

def generate_with_fallback(prompt):
    for model in providers:
        try:
            response = requests.post(
                "http://localhost:8000/v1/chat/completions",
                json={"model": model, "messages": [{"role": "user", "content": prompt}]},
                timeout=30
            )
            if response.status_code == 200:
                return response.json()
        except Exception:
            continue
    raise Exception("All providers failed")
```

### Local Model Gateway

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3-coder:30b

# Use via AbstractCore server
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/qwen3-coder:30b",
    "messages": [{"role": "user", "content": "Write a Python function"}]
  }'
```

---

## Troubleshooting

### Server Won't Start

```bash
# Check port availability
lsof -i :8000

# Use different port
uvicorn abstractcore.server.app:app --port 3000
```

### No Models Available

```bash
# Check providers
curl http://localhost:8000/providers

# Check API keys
echo $OPENAI_API_KEY

# Start Ollama
ollama serve
ollama list
```

### Authentication Errors

```bash
# Set API keys
export ABSTRACTCORE_AUTH_TOKEN="acore-server-secret"
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Restart server after setting keys
```

---

## Why AbstractCore Server?

- **Universal**: One API for all providers  
- **OpenAI Compatible**: Drop-in replacement  
- **Simple**: Clean, focused endpoints  
- **Fast**: Lightweight, high-performance  
- **Debuggable**: Comprehensive logging  
- **CLI Ready**: Codex, Gemini CLI, Crush support  
- **Production Ready**: Docker, multi-worker, health checks  

---

## Related Documentation

- **[Getting Started](getting-started.md)** - Core library quick start
- **[Architecture](architecture.md)** - System architecture including server
- **[Python API Reference](api-reference.md)** - Core library API
- **[Embeddings Guide](embeddings.md)** - Embeddings deep dive
- **[Troubleshooting](troubleshooting.md)** - Common issues and solutions

---

**AbstractCore Server** - One server, all models, any client.

---

### Inlined: `docs/endpoint.md`

# Endpoint (single-model `/v1` server)

`abstractcore-endpoint` runs a **single-model** OpenAI-compatible server.

Unlike the multi-provider gateway ([Server](server.md)), this endpoint loads **one** `provider+model` once per worker process and reuses it across requests. It’s useful when you want to host a local backend (for example HF GGUF or MLX) as a stable `/v1` endpoint.

Source: `abstractcore/endpoint/app.py` (entrypoint: `abstractcore-endpoint`).

## When to use this vs the gateway

- Use **[Server](server.md)** when you want `model="provider/model"` routing across many providers/models from one gateway process.
- Use **Endpoint** when you want a dedicated “one worker = one model” process (simpler performance characteristics; fewer per-request initialization costs).

## Install

```bash
pip install "abstractcore[server]"
```

Then install the provider extra you need:

```bash
pip install "abstractcore[mlx]"         # Apple Silicon local inference
pip install "abstractcore[huggingface]" # Transformers / torch / llama-cpp-python (heavy)
```

## Run

```bash
# CLI flags
abstractcore-endpoint --provider mlx --model mlx-community/Qwen3-4B --host 0.0.0.0 --port 8001

# Or via env vars
export ABSTRACTENDPOINT_PROVIDER=mlx
export ABSTRACTENDPOINT_MODEL=mlx-community/Qwen3-4B
export ABSTRACTENDPOINT_HOST=0.0.0.0
export ABSTRACTENDPOINT_PORT=8001
abstractcore-endpoint
```

Health check:

```bash
curl http://localhost:8001/health
```

## Use with an OpenAI-compatible client

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8001/v1", api_key="unused")
resp = client.chat.completions.create(
    model="anything",  # ignored/validated in single-model mode
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
```

## Prompt cache control plane (optional)

If the underlying provider exposes prompt-cache controls, the endpoint also exposes a small control plane under `/acore/prompt_cache/*` (see `abstractcore/endpoint/app.py`):

- `GET /acore/prompt_cache/stats`
- `GET /acore/prompt_cache/capabilities`
- `POST /acore/prompt_cache/set`
- `POST /acore/prompt_cache/update`
- `POST /acore/prompt_cache/fork`
- `POST /acore/prompt_cache/clear`
- `POST /acore/prompt_cache/prepare_modules`

Response contract:

- `GET /acore/prompt_cache/capabilities` always returns the provider capability profile (`supported`, `operation="capabilities"`, `capabilities`).
- Other prompt-cache routes return structured payloads instead of ambiguous booleans:
  - success: `supported=true`
  - unsupported operation: `supported=false`, `code="prompt_cache_unsupported"`
  - runtime/provider failure: `supported=false`, `code="prompt_cache_error"`

The `capabilities` object is always included on prompt-cache control-plane responses so callers can branch on `mode` / `supports_*` flags without re-probing the provider.

`POST /acore/prompt_cache/update` also accepts optional `thinking`, which is applied before the provider appends the cached fragment so cache-prefilled prompt state stays aligned with later generation calls.

For caching concepts, see [Session Management](session.md) and [Architecture](architecture.md).
For a dedicated overview, see [Prompt Caching](prompt-caching.md).

## Memory blocs and durable provider-backed cache artifacts

`AbstractEndpoint` also exposes `/acore/blocs/*` for the one-text/file -> one bloc -> one
provider/model cache artifact flow:

- `POST /acore/blocs/upsert_text`
- `GET /acore/blocs`
- `GET /acore/blocs/record`
- `POST /acore/blocs/delete`
- `GET /acore/blocs/kv/manifest`
- `GET /acore/blocs/kv/list`
- `POST /acore/blocs/kv/ensure`
- `POST /acore/blocs/kv/load`
- `POST /acore/blocs/kv/delete`
- `POST /acore/blocs/kv/prune`

`/acore/blocs/kv/load` returns `artifact.key`, `artifact.binding_id`, and
`artifact.prompt_cache_binding`. Pass `prompt_cache_binding` to `/v1/chat/completions` when exact
request-time binding is required. `debug=true` on ensure/load returns verbose proof fields.
Delete/prune routes are safe by default: a loaded artifact returns `409` until the caller clears
the matching live key with `clear_loaded=true` or explicitly forces deletion.

The shared route shape covers MLX, HuggingFace transformers, and supported HuggingFace GGUF
exact-renderer paths. Artifact payloads are provider/model-native, not a universal KV tensor
format. Cache keys are worker-local to the endpoint process.

---

### Inlined: `docs/troubleshooting.md`

# AbstractCore Troubleshooting Guide

Complete troubleshooting guide for AbstractCore core library and server, including common mistakes and how to avoid them.

## Table of Contents

- [Common Mistakes to Avoid](#common-mistakes-to-avoid)
- [Quick Diagnosis](#quick-diagnosis)
- [Installation Issues](#installation-issues)
- [Core Library Issues](#core-library-issues)
- [Server Issues](#server-issues)
- [Provider-Specific Issues](#provider-specific-issues)
- [Performance Issues](#performance-issues)
- [Best Practices](#best-practices)
- [Debug Techniques](#debug-techniques)

---

## Common Mistakes to Avoid

Understanding common pitfalls helps prevent issues before they occur.

### Top mistakes (fast fixes)

1. **Incorrect provider configuration**
   - *Symptom*: Authentication failures, no model response
   - *Quick Fix*: Set API keys via environment variables (or persist them with `abstractcore --set-api-key ...`)
   - See: [Authentication Errors](#issue-authentication-errors)

2. **Not handling tool calls**
   - *Symptom*: Tools not executing, streaming interruptions
   - *Quick Fix*: Use `@tool` decorator and handle tool calls properly
   - See: [Tool Calls Not Working](#issue-tool-calls-not-working)

3. **Missing provider extras**
   - *Symptom*: `ModuleNotFoundError` for providers
   - *Quick Fix*: Install provider-specific packages with `pip install "abstractcore[provider]"`
   - See: [ModuleNotFoundError](#issue-modulenotfounderror)

4. **LM Studio server not enabled**
   - *Symptom*: Connection refused, no response from LM Studio
   - *Quick Fix*: Enable "Status: Running" toggle in LM Studio GUI
   - See: [LM Studio Server Not Enabled](#issue-lm-studio-server-not-enabled)

5. **Context length too small (LM Studio/Ollama)**
   - *Symptom*: 400 Bad Request, truncated responses, errors with long inputs
   - *Quick Fix*: Set "Default Context Length" to "Model Maximum" in LM Studio
   - See: [Context Length Too Small](#issue-context-length-too-small-400-bad-request-truncated-responses)

### Common Mistake Patterns

#### Mistake: Missing or Incorrect API Keys

**You'll See:**
- `ProviderAPIError: Authentication failed`
- No response from the model
- Cryptic error messages about credentials

**Why This Happens:**
- API keys not set as environment variables
- Whitespace or copying errors in key
- Incorrect key permissions or expired credentials

**Solution:** See [Authentication Errors](#issue-authentication-errors) for complete fix.

**Prevention:**
- Use environment variables for sensitive credentials
- Store keys in `.env` files (add to `.gitignore`)
- Regularly rotate and update API keys
- Use secret management tools for production

#### Mistake: Incorrect Tool Call Handling

**You'll See:**
- Tools not executing during generation
- Partial or missing tool call results
- Streaming interruptions

**Why This Happens:**
- Not using `@tool` decorator
- Incorrect tool definition format
- Not handling tool responses

**Solution:**
```python
from abstractcore import create_llm, tool

# Use @tool decorator for automatic tool definition
@tool
def get_weather(city: str) -> str:
    """Get current weather for a city."""
    return f"Weather in {city}: sunny, 72°F"

llm = create_llm("openai", model="gpt-4o-mini")
response = llm.generate(
    "What's the weather in Tokyo?",
    tools=[get_weather]  # Pass decorated function directly
)
```

**Prevention:**
- Always use `@tool` decorator for automatic tool definitions
- Use type hints for all parameters
- Add clear docstrings for tool descriptions
- Handle tool execution errors gracefully
- See: [Tool Calls Not Working](#issue-tool-calls-not-working)

#### Mistake: Overlooking Error Handling

**You'll See:**
- Unhandled exceptions
- Silent failures in tool or generation calls
- Unexpected application crashes

**Why This Happens:**
- Not catching provider-specific exceptions
- Assuming 100% reliability of LLM responses
- No retry or fallback mechanisms

**Solution:**
```python
from abstractcore import create_llm
from abstractcore.exceptions import ProviderAPIError, RateLimitError

providers = [
    ("openai", "gpt-4o-mini"),
    ("anthropic", "claude-haiku-4-5"),
    ("ollama", "qwen3-coder:30b")
]

def generate_with_fallback(prompt):
    for provider, model in providers:
        try:
            llm = create_llm(provider, model=model)
            return llm.generate(prompt)
        except (ProviderAPIError, RateLimitError) as e:
            print(f"Failed with {provider}: {e}")
            continue
    raise Exception("All providers failed")
```

**Prevention:**
- Always use try/except blocks
- Implement provider fallback strategies
- Log and monitor errors systematically
- Design for graceful degradation

#### Mistake: Memory and Performance Bottlenecks

**You'll See:**
- High memory consumption
- Slow response times
- Out-of-memory errors during long generations

**Why This Happens:**
- Not managing token limits
- Generating overly long responses
- Inefficient streaming configurations

**Solution:**
```python
# Optimize memory and performance
response = llm.generate(
    "Complex task",
    max_tokens=1000,  # Limit response length
    timeout=30,       # Set reasonable timeout
    temperature=0.7   # Control creativity/randomness
)
```

**Prevention:**
- Always set `max_tokens`
- Use streaming for long responses
- Monitor memory usage in production
- See: [Performance Issues](#performance-issues)

#### Mistake: Hardcoding Credentials

**You'll See:**
- Exposed API keys in code
- Inflexible configuration management
- Security vulnerabilities

**Why This Happens:**
- Copying example code directly
- Not understanding configuration best practices
- Lack of environment-based configuration

**Solution:**
```python
import os
from abstractcore import create_llm

# Best practice: Load from environment
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
DEFAULT_MODEL = os.getenv('DEFAULT_LLM_MODEL', 'gpt-4o-mini')

llm = create_llm(
    "openai",
    model=DEFAULT_MODEL,
    api_key=OPENAI_API_KEY
)
```

**Prevention:**
- Never hardcode API keys or sensitive data
- Use environment variables
- Implement configuration management libraries
- Follow 12-factor app configuration principles

---

## Quick Diagnosis

Run these checks first:

```bash
# Check Python version
python --version  # Should be 3.9+

# Check AbstractCore installation
pip show abstractcore

# Test core library
python -c "from abstractcore import create_llm; print('✓ Core library OK')"

# Test server (if installed)
curl http://localhost:8000/health  # Should return {"status":"healthy"}
```

---

## Installation Issues

### Issue: ModuleNotFoundError

**Symptoms:**
```
ModuleNotFoundError: No module named .abstractcore.
ModuleNotFoundError: No module named 'openai'
```

**Solutions:**
```bash
# Install AbstractCore
pip install abstractcore

# Install hosted SDKs or a specific provider
pip install "abstractcore[remote]"
pip install "abstractcore[openai]"
pip install "abstractcore[anthropic]"
# Local OpenAI-compatible servers and gateways (Ollama, LMStudio, OpenRouter,
# Portkey, llama.cpp, ...) work with the core install.

# Turnkey local-runtime installs
pip install "abstractcore[all-apple]"    # Apple Silicon: HF/GGUF + MLX + features + server
pip install "abstractcore[all-gpu]"      # NVIDIA GPU: HF/GGUF + vLLM + features + server

# Verify installation
pip list | grep abstract
```

### Issue: Dependency Conflicts

**Symptoms:**
```
ERROR: pip's dependency resolver does not currently take into account all the packages...
```

**Solutions:**
```bash
# Create clean environment
python3 -m venv .venv
source .venv/bin/activate  # Linux/Mac
# OR
.venv\Scripts\activate  # Windows

# Fresh install
pip install --upgrade pip
pip install "abstractcore[remote,tools,media]"  # API-first app
# or: pip install "abstractcore[all-apple]"     # Apple Silicon local stack
# or: pip install "abstractcore[all-gpu]"       # NVIDIA GPU local stack

# If still failing, try one provider at a time
pip install "abstractcore[openai]"
```

---

## Core Library Issues

### Issue: Authentication Errors

**Symptoms:**
```
Error: OpenAI API key not found
Error: Authentication failed
Error: Invalid API key
```

**Solutions:**

```bash
# Check if API key is set
echo $OPENAI_API_KEY  # Should show your key
echo $ANTHROPIC_API_KEY

# Set API key
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Add to shell profile for persistence
echo 'export OPENAI_API_KEY="sk-..."' >> ~/.bashrc
source ~/.bashrc

# Verify key format
# OpenAI: starts with "sk-"
# Anthropic: starts with "sk-ant-"

# Test authentication
python -c "from abstractcore import create_llm; llm = create_llm('openai', model='gpt-4o-mini'); print(llm.generate('test').content)"
```

### Issue: Model Not Found

**Symptoms:**
```
Error: Model 'qwen3-coder:30b' not found
Error: Unsupported model
```

**Solutions:**

**For Ollama:**
```bash
# Check available models
ollama list

# Pull missing model
ollama pull qwen3-coder:30b

# Verify Ollama is running
ollama serve
```

**For LMStudio:**
```bash
# Check LMStudio server
curl http://localhost:1234/v1/models

# In LMStudio GUI:
# 1. Go to "Local Server" tab
# 2. Select model from dropdown
# 3. Click "Start Server"
```

**For OpenAI/Anthropic:**
```python
# Use correct model names
llm = create_llm("openai", model="gpt-4o-mini")  # ✓ Correct
llm = create_llm("openai", model="gpt4")  # ✗ Wrong

llm = create_llm("anthropic", model="claude-haiku-4-5")  # ✓ Correct
llm = create_llm("anthropic", model="claude-3")  # ✗ Wrong
```

### Issue: Connection Errors

**Symptoms:**
```
Connection refused
Timeout error
Network error
```

**Solutions:**

**For Ollama:**
```bash
# Start Ollama service
ollama serve

# Check if running
curl http://localhost:11434/api/tags

# If using custom host
export OLLAMA_HOST="http://localhost:11434"
```

**For LMStudio:**
```bash
# Verify server is running
curl http://localhost:1234/v1/models

# Check port in LMStudio GUI (usually 1234)
```

**For Cloud Providers:**
```bash
# Test network connection
ping api.openai.com
ping api.anthropic.com

# Check proxy settings
echo $HTTP_PROXY
echo $HTTPS_PROXY

# Disable proxy if needed
unset HTTP_PROXY
unset HTTPS_PROXY
```

### Issue: Tool Calls Not Working

**Symptoms:**
- Tools not being called
- Empty tool responses
- Tool format errors

**Solutions:**

```python
from abstractcore import create_llm, tool

# Ensure @tool decorator is used
@tool
def get_weather(city: str) -> str:
    """Get weather for a city."""
    return f"Weather in {city}: sunny, 72°F"

# Use tool correctly
llm = create_llm("openai", model="gpt-4o-mini")
response = llm.generate(
    "What's the weather in Paris?",
    tools=[get_weather]  # Pass as list
)

# Check if tool was called
if hasattr(response, 'tool_calls') and response.tool_calls:
    print("Tools were called")
```

---

## Server Issues

### Issue: Server Won't Start

**Symptoms:**
```
Address already in use
Port 8000 is already allocated
```

**Solutions:**

```bash
# Check what's using port 8000
lsof -i :8000  # Linux/Mac
netstat -ano | findstr :8000  # Windows

# Kill process on port
kill -9 $(lsof -t -i:8000)  # Linux/Mac

# Use different port
uvicorn abstractcore.server.app:app --port 3000
```

### Issue: Client complains about missing API key

**Symptoms:**
- Your OpenAI-compatible client/CLI refuses to run without an API key (even though your server is local).

**Solutions:**

```bash
# Most OpenAI-compatible clients accept a dummy key for local servers.
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="unused"

# Verify they're set
echo "$OPENAI_BASE_URL"
echo "$OPENAI_API_KEY"
```

### Issue: Server Running but No Response

**Symptoms:**
- curl hangs
- No response from endpoints
- Timeout errors

**Solutions:**

```bash
# Check server is actually running
curl http://localhost:8000/health

# Check server logs
tail -f logs/abstractcore_*.log

# Enable debug mode
export ABSTRACTCORE_DEBUG=true
uvicorn abstractcore.server.app:app --host 0.0.0.0 --port 8000

# Test with simple request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "test"}]}'
```

### Issue: Models Not Showing

**Symptoms:**
```
curl http://localhost:8000/v1/models returns empty list
```

**Solutions:**

```bash
# Check if providers are configured
curl http://localhost:8000/providers

# Verify provider setup:

# For Ollama
ollama list  # Should show models
ollama serve  # Make sure it's running

# For OpenAI
echo $OPENAI_API_KEY  # Should be set

# For Anthropic
echo $ANTHROPIC_API_KEY  # Should be set

# For LMStudio
curl http://localhost:1234/v1/models  # Should return models
```

### Issue: Tool Calls Not Working with CLI

**Symptoms:**
- Codex/Crush/Gemini CLI not detecting tools
- Tool format errors in streaming

**Solutions:**

```bash
# AbstractCore Server controls tool-call syntax via `agent_format` (request field) or auto-detection.
# - OpenAI/Codex style: structured tool calls are returned in `tool_calls` fields.
# - Tag-based formats: tool calls are emitted as tagged content for clients that parse from assistant text.

# If you control requests (curl/custom client), force a format with `agent_format`:
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/qwen3:4b-instruct-2507-q4_K_M",
    "messages": [{"role": "user", "content": "Use the tool."}],
    "tools": [{"type":"function","function":{"name":"get_weather","description":"...","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],
    "agent_format": "llama3"
  }'
```

See [Server](server.md#agentic-cli-integration) for details and supported formats.

---

## Provider-Specific Issues

### Ollama

**Issue: Ollama not responding**
```bash
# Restart Ollama
pkill ollama
ollama serve

# Check status
curl http://localhost:11434/api/tags

# List models
ollama list

# Pull model if missing
ollama pull qwen3-coder:30b
```

**Issue: Out of memory**
```bash
# Use smaller models
ollama pull gemma3:1b  # Only 1GB
ollama pull qwen3:4b-instruct-2507-q4_K_M  # 4GB

# Check system memory
free -h  # Linux
vm_stat  # macOS

# Close other applications
```

### OpenAI

**Issue: Rate limits**
```bash
# Check your rate limits
# https://platform.openai.com/account/rate-limits

# Implement backoff in code
import time
try:
    response = llm.generate("prompt")
except RateLimitError:
    time.sleep(20)  # Wait before retry
```

**Issue: Billing**
```bash
# Check billing dashboard
# https://platform.openai.com/account/billing

# Verify payment method is added
# Check usage limits aren't exceeded
```

### Anthropic

**Issue: API key format**
```bash
# Anthropic keys start with "sk-ant-"
echo $ANTHROPIC_API_KEY  # Should start with sk-ant-

# Get key from console
# https://console.anthropic.com/
```

### LMStudio

#### Issue: Connection refused
```bash
# Verify LMStudio server is running
# Check LMStudio GUI shows "Server running"

# Test connection
curl http://localhost:1234/v1/models

# Check port number in LMStudio (usually 1234)
```

#### Issue: LM Studio Server Not Enabled
```bash
# CRITICAL: Ensure LM Studio server is enabled in the GUI
# 1. Open LM Studio application
# 2. Look for "Status: Running" toggle switch in the interface
# 3. Make sure the toggle is switched to "ON" (green background, white handle on right)
# 4. If the toggle shows "OFF", click it to enable the server
# 5. Verify the server is running by checking the status indicator

# Test server availability
curl http://localhost:1234/v1/models

# If still failing, check LM Studio logs for any error messages
```

#### Issue: Context Length Too Small (400 Bad Request, Truncated Responses)
```bash
# Problem: LLM returns 400 Bad Request, truncated output, or errors with long inputs
# Root Cause: Insufficient context length configured for the model or server

# Solution 1: Increase Default Context Length (RECOMMENDED)
# This is the most robust way to ensure all models use maximum available context
# 1. Open LM Studio application
# 2. Go to "App Settings" → "General" tab
# 3. Find "Model Defaults" → "Default Context Length"
# 4. Set dropdown to "Model Maximum" (or highest available value like 131072)
# 5. Restart LM Studio server for changes to take effect

# Solution 2: Increase Context Length per Model (Alternative)
# This method applies context length setting to a specific model
# 1. Open LM Studio application
# 2. Go to "My Models" tab
# 3. Select the specific model you are using
# 4. Look for "Context Length" slider/input (usually under "Load" or "Context" tab)
# 5. Adjust slider to maximum value (e.g., 131072 tokens)
# 6. Reload the model for changes to take effect

# Solution 3: Increase Context Length via API Request (Advanced)
# For Ollama, or if you need to override settings for LM Studio via API
# For Ollama:
ollama run <model_name> -c <context_length>
# Example: ollama run llama2 -c 4096

# For LM Studio via API (often handled automatically by AbstractCore):
# Include in request payload:
# {
#   "model": "your-model-name",
#   "prompt": "Your long prompt here...",
#   "options": {
#     "num_ctx": 4096  # Or your desired context length
#   }
# }

# Verification:
# After adjusting, test with a long prompt that previously failed
# Check server logs for any warnings or errors related to context
```

---

## Performance Issues

### Issue: Slow Responses

**Diagnosis:**
```bash
# Time a request
time python -c "from abstractcore import create_llm; llm = create_llm('ollama', model='qwen3:4b-instruct-2507-q4_K_M'); print(llm.generate('test').content)"
```

**Solutions:**

**Use Faster Models:**
```python
# Faster cloud models
llm = create_llm("openai", model="gpt-4o-mini")  # Fast
llm = create_llm("anthropic", model="claude-haiku-4-5")  # Fast

# Faster local models
llm = create_llm("ollama", model="gemma3:1b")  # Very fast
llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M")  # Balanced
```

**Enable Streaming:**
```python
# Improves perceived speed
for chunk in llm.generate("Long response", stream=True):
    print(chunk.content, end="", flush=True)
```

**Optimize Parameters:**
```python
response = llm.generate(
    "prompt",
    max_tokens=500,      # Limit length
    temperature=0.3      # Lower = faster
)
```

### Issue: High Memory Usage

**Solutions:**

```bash
# Use smaller models
ollama pull gemma3:1b  # 1GB instead of 30GB

# Close other applications

# For MLX on Mac
# Use 4-bit quantized models
llm = create_llm("mlx", model="mlx-community/Llama-3.2-3B-Instruct-4bit")
```

---

## Best Practices

Follow these best practices to avoid issues:

### Configuration Management
- Use environment variables for API keys
- Never commit credentials to version control
- Use `.env` files (add to `.gitignore`)
- Implement configuration validation
- Use secret management in production

### Tool Development
- Always use `@tool` decorator
- Add type hints to all parameters
- Write clear docstrings
- Handle edge cases and errors
- Test tools independently first

### Error Handling
- Always use try/except blocks
- Implement provider fallback strategies
- Log errors systematically
- Design for graceful degradation
- Monitor error rates in production

### Performance
- Always set `max_tokens`
- Use streaming for long responses
- Batch similar requests when possible
- Monitor memory usage
- Profile slow operations

### Security
- Validate all user inputs
- Sanitize file paths and commands
- Use least privilege principle
- Regular security audits
- Keep dependencies updated

---

## Debug Techniques

### Enable Debug Logging

**Core Library:**
```python
import logging
logging.basicConfig(level=logging.DEBUG)

from abstractcore import create_llm
llm = create_llm("openai", model="gpt-4o-mini")
```

**Server:**
```bash
# Enable debug mode
export ABSTRACTCORE_DEBUG=true

# Start with debug logging
uvicorn abstractcore.server.app:app --log-level debug

# Monitor logs
tail -f logs/abstractcore_*.log
```

### Analyze Logs

```bash
# Find errors
grep '"level": "error"' logs/abstractcore_*.log

# Track specific request
grep "req_abc123" logs/abstractcore_*.log

# Monitor latency
cat logs/verbatim_*.jsonl | jq '.metadata.latency_ms'

# Token usage
cat logs/verbatim_*.jsonl | jq '.metadata.tokens | .input + .output' | \
  awk '{sum+=$1} END {print "Total:", sum}'
```

### Test in Isolation

```python
# Test provider directly
from abstractcore import create_llm

try:
    llm = create_llm("openai", model="gpt-4o-mini")
    response = llm.generate("Hello")
    print(f"✓ Success: {response.content}")
except Exception as e:
    print(f"✗ Error: {e}")
```

### Collect Debug Information

```bash
# Create debug report
echo "=== System ===" > debug_report.txt
uname -a >> debug_report.txt
python --version >> debug_report.txt

echo "=== Packages ===" >> debug_report.txt
pip freeze | grep -E "abstract|openai|anthropic" >> debug_report.txt

echo "=== Environment ===" >> debug_report.txt
env | grep -E "ABSTRACT|OPENAI|ANTHROPIC|OLLAMA" >> debug_report.txt

echo "=== Tests ===" >> debug_report.txt
python -c "from abstractcore import create_llm; print('Core library: OK')" >> debug_report.txt 2>&1
curl http://localhost:8000/health >> debug_report.txt 2>&1

cat debug_report.txt
```

---

## Common Error Messages

| Error | Meaning | Solution |
|-------|---------|----------|
| `ModuleNotFoundError` | Package not installed | `pip install abstractcore` (then add provider extras as needed) |
| `Authentication Error` | Invalid API key | Check API key environment variable |
| `Connection refused` | Service not running | Start Ollama/LMStudio/server |
| `LM Studio connection failed` | LM Studio server not enabled | Enable "Status: Running" toggle in LM Studio GUI |
| `400 Bad Request` (LM Studio) | Context length too small | Increase Default Context Length to "Model Maximum" in LM Studio |
| `Model not found` | Model unavailable | Pull model or check name |
| `Rate limit exceeded` | Too many requests | Wait or upgrade plan |
| `Timeout` | Request took too long | Use smaller model or increase timeout |
| `Out of memory` | Insufficient RAM | Use smaller model |
| `Port already in use` | Another process using port | Kill process or use different port |

---

## Getting Help

If you're still stuck:

1. **Check Documentation:**
   - [Getting Started](getting-started.md) - Core library quick start
   - [Prerequisites](prerequisites.md) - Provider setup
   - [Python API Reference](api-reference.md) - Core library API
   - [Server Guide](server.md) - Server setup
   - [Server API Reference](server.md) - REST API endpoints

2. **Enable Debug Mode:**
   ```bash
   export ABSTRACTCORE_DEBUG=true
   ```

3. **Collect Information:**
   - Error messages
   - Debug logs
   - System information
   - Steps to reproduce

4. **Community Support:**
   - GitHub Issues: [github.com/lpalbou/AbstractCore/issues](https://github.com/lpalbou/AbstractCore/issues)
   - GitHub Discussions: [github.com/lpalbou/AbstractCore/discussions](https://github.com/lpalbou/AbstractCore/discussions)

---

**Remember**: Most issues are configuration-related. Double-check environment variables, API keys, and that services are running before diving deep into debugging.

---

### Inlined: `docs/faq.md`

# FAQ

## What do I get with `pip install abstractcore`?

The default install is intentionally lightweight. It includes the core API (`create_llm`, `BasicSession`, tool definitions, structured output plumbing) and uses only small dependencies (`pydantic`, `httpx`).

Anything heavy (provider SDKs, torch/transformers, PDF parsing, embeddings models, web scraping deps, the HTTP server) is behind install extras. See [Getting Started](getting-started.md) and [Prerequisites](prerequisites.md).

## Which extra do I need for my provider?

- Hosted SDK bundle: `pip install "abstractcore[remote]"` installs OpenAI + Anthropic.
- OpenAI: `pip install "abstractcore[openai]"`
- Anthropic: `pip install "abstractcore[anthropic]"`
- OpenRouter, Portkey, Ollama, LM Studio, and generic OpenAI-compatible `/v1` endpoints: core install is enough (`pip install abstractcore`).
- HuggingFace (transformers/torch; heavy): `pip install "abstractcore[huggingface]"`
- Apple Silicon local LLM stack: `pip install "abstractcore[apple]"` (alias of `mlx`; heavy)
- GPU local LLM stack: `pip install "abstractcore[gpu]"` (alias of `vllm`; heavy)
- Explicit provider extras remain available: `abstractcore[mlx]`, `abstractcore[vllm]`

These providers work with the core install (no provider extra): `ollama`, `lmstudio`, `openrouter`, `portkey`, `openai-compatible`.

## How do I combine extras?

```bash
# zsh: keep quotes
pip install "abstractcore[remote,media,tools]"
```

For “turnkey” local-runtime installs, see `README.md` (`all-apple` for Apple Silicon, `all-gpu` for NVIDIA GPU). The `apple` and `gpu` extras install only the hardware-specific local LLM engine stack; the `all-*` extras are larger aggregate profiles that also include local capability plugin engines where supported.

## Why did my install pull `torch` / take a long time?

You probably installed a heavy extra (most commonly `abstractcore[huggingface]`, `abstractcore[apple]`/`abstractcore[mlx]`, `abstractcore[gpu]`/`abstractcore[vllm]`, or `abstractcore[all-*]`). The core install (`pip install abstractcore`) does not include torch/transformers.

## What’s the difference between “provider” and “model”?

- **Provider**: a backend adapter (`openai`, `anthropic`, `ollama`, `lmstudio`, …)
- **Model**: a provider-specific model name (for example `gpt-4o-mini` or `qwen3:4b-instruct-2507-q4_K_M`)

```python
from abstractcore import create_llm
llm = create_llm("openai", model="gpt-4o-mini")
```

## How does AbstractCore relate to AbstractFramework / AbstractRuntime?

AbstractCore is one of the core packages in the **AbstractFramework** ecosystem:

- AbstractFramework (umbrella): https://github.com/lpalbou/AbstractFramework
- AbstractCore (this package): unified LLM interface + cross-provider infrastructure
- AbstractRuntime: durable tool/effect execution, workflows, and state persistence — https://github.com/lpalbou/abstractruntime

AbstractCore is usable standalone. In the ecosystem, the common pattern is:
- AbstractCore produces `resp.content` + `resp.tool_calls`
- a runtime (for example AbstractRuntime) decides whether/how to execute tools (policy, sandboxing, retries, persistence)

See [Architecture](architecture.md) and [Tool Calling](tool-calling.md).

## How do I connect to a local server (Ollama / LMStudio / vLLM / llama.cpp / LocalAI)?

Use the matching provider and set `base_url` (or the provider’s base-url env var).
We recommend open-source/local providers first; cloud and gateway providers are optional.

Examples:

```python
from abstractcore import create_llm

llm = create_llm("ollama", model="qwen3:4b-instruct-2507-q4_K_M", base_url="http://localhost:11434")
llm = create_llm("lmstudio", model="qwen/qwen3-4b-2507", base_url="http://localhost:1234/v1")
llm = create_llm("vllm", model="Qwen/Qwen3-Coder-30B-A3B-Instruct", base_url="http://localhost:8000/v1")
```

For a generic OpenAI-compatible endpoint, use `openai-compatible`:

```python
llm = create_llm("openai-compatible", model="my-model", base_url="http://localhost:1234/v1")
```

See [Prerequisites](prerequisites.md) for setup details and env var names.

## Why do gateway providers return “unsupported parameter” errors (temperature/max_tokens)?

Gateways like Portkey and OpenRouter forward your payload to the routed backend model, and strict families (for example OpenAI reasoning models like gpt-5/o1) reject unsupported parameters.

In AbstractCore’s gateway providers:
- Portkey uses `PORTKEY_API_KEY` and `PORTKEY_CONFIG` (config id) for routing.
- Optional params (`temperature`, `top_p`, `max_output_tokens`) are only sent when you explicitly set them.
- Reasoning families (gpt-5/o1) drop `temperature`/`top_p` and use `max_completion_tokens` instead of `max_tokens`.

If you still see errors, confirm:
- You aren’t mixing routing modes (config vs virtual key vs provider-direct).
- You’re not injecting parameters via Portkey config overrides that the backend rejects.

## How do I set API keys and defaults?

You can use environment variables, or persist settings via the config CLI:

```bash
abstractcore --config
abstractcore --set-api-key openai sk-...
abstractcore --set-api-key anthropic sk-ant-...
abstractcore --set-server-auth-token acore-server-secret
abstractcore --status
```

Config is stored in `~/.abstractcore/config/abstractcore.json`. See [Centralized Config](centralized-config.md).

## Can I use the HTTP server with only provider API keys?

Yes. You do not have to give a client the AbstractCore server auth token. If `ABSTRACTCORE_AUTH_TOKEN` is not configured, a client can bring its own upstream provider key, for example an Anthropic, OpenRouter, or Portkey key, by sending it as `Authorization: Bearer <provider-key>` or `X-AbstractCore-Provider-API-Key`.

That key is forwarded only to the provider requested by the model route, such as `anthropic/...`, `openrouter/...`, or `portkey/...`. It does not unlock other server-configured provider keys, and it does not grant access to providers the client did not supply credentials for.

If `ABSTRACTCORE_AUTH_TOKEN` is configured, `Authorization` is reserved for the AbstractCore server auth token. In that mode, use `X-AbstractCore-Provider-API-Key` only when you want to override the upstream provider key for a single request. Provider keys in request bodies remain disabled; select discovery endpoints accept an `api_key` query parameter for tooling/Swagger UI convenience.

## Why aren’t tools executed automatically?

By default, AbstractCore runs in **pass-through** mode (`execute_tools=False`): it returns tool calls in `resp.tool_calls`, and your host/runtime decides whether/how to execute them.

Automatic execution (`execute_tools=True`) exists but is deprecated for most use cases. See [Tool Calling](tool-calling.md).

## What’s the difference between `web_search`, `skim_websearch`, `skim_url`, and `fetch_url`?

These built-in web tools live in `abstractcore.tools.common_tools` and require:

```bash
pip install "abstractcore[tools]"
```

- `web_search`: fuller DuckDuckGo result set (good when you want breadth or more options).
- `skim_websearch`: compact/filtered search results (good default for agents to keep prompts smaller). Defaults to 5 results and truncates long snippets.
- `skim_url`: fast URL triage (fetches only a prefix and extracts lightweight metadata + a short preview). Defaults: `max_bytes=200_000`, `max_preview_chars=1200`, `max_headings=8`.
- `fetch_url`: full fetch + parsing for text-first types (HTML→Markdown, JSON/XML/text). For PDFs/images/other binaries it returns metadata and optional previews; it does **not** do full PDF text extraction. It downloads up to 10MB by default; use `include_full_content=False` for smaller outputs.

Recommended workflow: `skim_websearch` → `skim_url` → `fetch_url` (use `include_full_content=False` when you want a smaller `fetch_url` output).

## How do I preserve tool-call markup in `response.content` for agentic CLIs?

Use tool-call syntax rewriting:

- Python: pass `tool_call_tags=...` to `generate()` / `agenerate()`
- Server: set `agent_format` in requests

See [Tool Syntax Rewriting](tool-syntax-rewriting.md).

## How do I get structured output (typed objects) instead of parsing JSON?

Pass a Pydantic model via `response_model=...`:

```python
from pydantic import BaseModel
from abstractcore import create_llm

class Answer(BaseModel):
    title: str
    bullets: list[str]

llm = create_llm("openai", model="gpt-4o-mini")
result = llm.generate("Summarize HTTP/3 in 3 bullets.", response_model=Answer)
```

See [Structured Output](structured-output.md).

## Why does structured output retry or fail validation?

Structured output is validated against your schema. If validation fails, AbstractCore retries with feedback (up to the configured retry limit). Common fixes:

- simplify schemas (fewer nested structures; fewer strict constraints)
- tighten prompts (be explicit about allowed values and ranges)
- increase timeouts for slow backends

See [Structured Output](structured-output.md) and [Troubleshooting](troubleshooting.md).

## Why do PDFs / Office docs / images not work?

Those require the media extra:

```bash
pip install "abstractcore[media]"
```

Then pass `media=[...]` to `generate()` or use the media pipeline. See [Media Handling](media-handling-system.md).

## How do I attach audio or video?

Audio and video attachments are supported via `media=[...]`, but they are **policy-driven** by design:

- **Audio** defaults to `audio_policy="auto"`: native audio when supported, otherwise the configured `input.voice` route.
- **Video** defaults to `video_policy="auto"`: native video when supported, otherwise sampled frames routed through visual support on `input.text` or an explicit `input.video` route. Frame sampling requires `ffmpeg`/`ffprobe`.

Speech-to-text fallback for audio requires an `input.voice` route for normal defaults. Direct `audio_policy="speech_to_text"` can still force explicit STT routing.

You can set defaults via the config CLI:

```bash
abstractcore --set-audio-strategy auto
abstractcore --set-video-strategy auto
abstractcore --set-video-max-frames 6
```

See:
- [Media Handling](media-handling-system.md) (policies + fallbacks)
- [Vision Capabilities](vision-capabilities.md) (image/video input + fallback behavior)

## How do I do speech-to-text (STT) or text-to-speech (TTS)?

Install the optional capability plugin package:

```bash
pip install "abstractcore[voice]"
```

This installs the remote-light AbstractVoice capability path. Local voice
engines require an explicit local profile such as `abstractcore[all-apple]` or
`abstractcore[all-gpu]`.

Then use the deterministic capability surfaces:

```python
from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")  # provider/model is only for LLM calls; STT/TTS are deterministic
print(llm.capabilities.status())  # shows which capability backends are available/selected

wav_bytes = llm.voice.tts("Hello", format="wav")
text = llm.audio.transcribe("speech.wav")
```

If you run the optional HTTP server, you can also use OpenAI-compatible endpoints:
- `POST /v1/audio/transcriptions`
- `POST /v1/audio/speech`

See: [Server](server.md) and [Capabilities](capabilities.md).

## How do I generate/edit images or generate video?

Generative vision is intentionally not part of AbstractCore’s default install. Use `abstractvision`:

```bash
pip install abstractvision
```

You can use it through AbstractCore’s `llm.vision.*` capability plugin surface (`t2i`, `i2i`, `t2v`, `i2v`) or through AbstractCore Server’s optional endpoints:
- `POST /v1/images/generations`
- `POST /v1/images/edits`

Local MLX-Gen models are selected by exact repo id, for example `AbstractFramework/qwen-image-2512-4bit`, `briaai/FIBO`, `AbstractFramework/wan2.2-t2v-a14b-diffusers-8bit`, or `AbstractFramework/wan2.2-i2v-a14b-diffusers-8bit`.

See: [Server](server.md), [Capabilities](capabilities.md), and `abstractvision/docs/reference/abstractcore-integration.md` (in the AbstractVision repo).

## What are “glyphs” and what do they require?

Glyph visual-text compression is an optional feature for long documents. Install:

- `pip install "abstractcore[compression]"` (renderer)
- plus `pip install "abstractcore[media]"` if you want PDF extraction support

See [Glyph Visual-Text Compression](glyphs.md).

## How do I use embeddings?

Embeddings are opt-in:

```bash
pip install "abstractcore[embeddings]"
```

Then import from the embeddings module:

```python
from abstractcore.embeddings import EmbeddingManager
```

See [Embeddings](embeddings.md).

## Do I need the HTTP server?

No. The server is optional and is mainly for:

- exposing one OpenAI-compatible `/v1` endpoint that can route to multiple providers/models
- integrating with OpenAI-compatible clients and agentic CLIs

Install and run:

```bash
pip install "abstractcore[server]"
abstractcore serve
```

See [Server](server.md).

## Where are logs and traces?

- Logging (console/file) is configured via the config CLI and config file. See [Structured Logging](structured-logging.md).
- Interaction tracing is opt-in (`enable_tracing=True`). See [Interaction Tracing](interaction-tracing.md).

## I’m getting HTTP timeouts. What should I change?

- Per-provider: pass `timeout=...` to `create_llm(...)` (`timeout=None` means unlimited).
- Process-wide default: set `abstractcore --set-default-timeout 0` (0 = unlimited), or set a larger value.
- Some CLI apps have their own `--timeout` flags; run `--help` for the exact behavior.

See [Troubleshooting](troubleshooting.md) and [Centralized Config](centralized-config.md).

## HuggingFace won’t download models — why?

The HuggingFace provider respects AbstractCore’s offline-first settings. If you want HuggingFace to fetch from the Hub, update `~/.abstractcore/config/abstractcore.json`:

- set `"offline_first": false`
- set `"force_local_files_only": false`

Restart your Python process after changing this (the provider reads these settings at import time).

## Is AbstractCore a full agent/RAG framework?

AbstractCore focuses on provider abstraction + infrastructure (tools, structured output, media handling, tracing). It does not ship a full RAG pipeline or multi-step agent orchestration. See [Capabilities](capabilities.md).

---

### Inlined: `docs/architecture.md`

# AbstractCore Architecture

AbstractCore provides a unified interface to major LLM providers with production-oriented reliability features. This document explains how it works internally and why it's designed this way.

If you're new to AbstractCore and want to start building quickly, read:
- `docs/getting-started.md`
- `docs/api.md`

Related docs (user-facing):
- Media inputs (images/audio/video + documents): `docs/media-handling-system.md`
- Vision input + fallback: `docs/vision-capabilities.md`
- Capability plugins (voice/audio/vision/music): `docs/capabilities.md`
- OpenAI-compatible gateway server: `docs/server.md`
- Single-model OpenAI-compatible endpoint: `docs/endpoint.md`
- Tool calling semantics (passthrough vs execution): `docs/tool-calling.md`

## System Overview

AbstractCore operates as a Python library and can also be exposed via **optional OpenAI-compatible HTTP servers**:

- **Gateway server (multi-provider)**: `abstractcore.server.app` (docs: `docs/server.md`)
- **Endpoint server (single-model)**: `abstractcore.endpoint.app` (docs: `docs/endpoint.md`)

```mermaid
graph TD
    A[Your Application] --> B[AbstractCore API]
    AA[HTTP Clients] --> BB[AbstractCore Server]
    BB --> B
    
    B --> C[Provider Interface]
    C --> D[Event System]
    C --> E[Tool System]
    C --> F[Retry System]
    C --> G[Provider Implementations]

    G --> H[OpenAI Provider]
    G --> HH[OpenAI-Compatible Provider]
    G --> I[Anthropic Provider]
    G --> J[Ollama Provider]
    G --> K[MLX Provider]
    G --> L[LMStudio Provider]
    G --> M[HuggingFace Provider]
    G --> MM[vLLM Provider]
    G --> MN[OpenRouter Provider]
    G --> MP[Portkey Provider]

    H --> N[OpenAI API]
    HH --> NN[OpenAI-Compatible /v1 Endpoint]
    I --> O[Anthropic API]
    J --> P[Ollama Server]
    K --> Q[MLX Models]
    L --> R[LMStudio Server]
    M --> S[HuggingFace Models]
    MM --> RR[vLLM Server]
    MN --> RO[OpenRouter API]
    MP --> RP[Portkey API Gateway]

    style B fill:#e1f5fe
    style BB fill:#4caf50
    style C fill:#f3e5f5
    style G fill:#fff3e0
```

## Design Principles

### 1. Provider Abstraction
**Goal**: Same interface for all providers
**Implementation**: Common interface with provider-specific implementations

### 2. Production Reliability
**Goal**: Handle real-world failures gracefully
**Implementation**: Built-in retry logic, circuit breakers, comprehensive error handling

### 3. Universal Tool Support
**Goal**: Tools work everywhere, even with providers that don't support them natively
**Implementation**: Native support where available, intelligent prompting as fallback

### 4. Simplicity Over Features
**Goal**: Clean, focused API that's easy to understand
**Implementation**: Minimal core with clear extension points

### 5. Optional HTTP Access
**Goal**: Flexible deployment as library or server
**Implementation**: OpenAI-compatible REST API built on core library

## Core Components

### 1. Factory Pattern (`create_llm`)

The main entry point uses the factory pattern for clean provider instantiation:

```mermaid
graph LR
    A[create_llm] --> B{Provider Type}
    B --> C[OpenAI Provider]
    B --> D[Anthropic Provider]
    B --> E[Ollama Provider]
    B --> F[Other Providers...]

    C --> G[Configured Instance]
    D --> G
    E --> G
    F --> G

    style A fill:#4caf50
    style G fill:#2196f3
```

```python
from abstractcore import create_llm

# Factory creates the right provider with proper configuration
llm = create_llm("openai", model="gpt-4o-mini", temperature=0.7)

# OpenAI-compatible /v1 endpoints (LMStudio, vLLM, custom proxies)
llm_local = create_llm("lmstudio", model="qwen/qwen3-4b-2507", base_url="http://localhost:1234/v1")
llm_openrouter = create_llm("openrouter", model="openai/gpt-4o-mini")  # requires OPENROUTER_API_KEY
llm_portkey = create_llm("portkey", model="gpt-4o-mini", config_id="pcfg_...")  # requires PORTKEY_API_KEY + PORTKEY_CONFIG
```

Gateway providers (OpenRouter/Portkey) route to external backends; AbstractCore forwards only **explicit** generation parameters to avoid sending defaults that strict backends reject.

### 2. Provider Interface

All providers implement `AbstractCoreInterface` (see `abstractcore/core/interface.py`):

```python
class AbstractCoreInterface(ABC):
    @abstractmethod
    def generate(
        self,
        prompt: str,
        messages: Optional[List[Dict[str, str]]] = None,
        system_prompt: Optional[str] = None,
        tools: Optional[List[Dict[str, Any]]] = None,
        media: Optional[List[Union[str, Dict[str, Any], "MediaContent"]]] = None,
        stream: bool = False,
        thinking: Optional[Union[bool, str]] = None,
        **kwargs,
    ) -> Union[GenerateResponse, Iterator[GenerateResponse]]:
        """Generate a response (or a stream of chunks)."""

    @abstractmethod
    def get_capabilities(self) -> List[str]:
        """Get provider capabilities"""

    @abstractmethod
    def unload_model(self, model_name: str) -> None:
        """Unload/cleanup resources for a specific model (best-effort)."""
```

This ensures:
- **Consistency**: Same methods across all providers
- **Reliability**: Standardized error handling
- **Extensibility**: Easy to add new providers
- **Memory Management**: Explicit control over model lifecycle

#### Response Normalization (Model Output Cleanup)

`BaseProvider` also applies **asset-driven response normalization** so downstream code sees clean, consistent output across providers:

- **Output wrappers**: Strip configured leading/trailing wrapper tokens (e.g., GLM `<|begin_of_box|>…<|end_of_box|>`)
- **Harmony transcripts (GPT-OSS)**: Extract `<|channel|>final` into `GenerateResponse.content` and capture `<|channel|>analysis` as `GenerateResponse.metadata["reasoning"]` (non-streaming)
- **Thinking tags**: Extract inline `<think>...</think>` blocks into `GenerateResponse.metadata["reasoning"]` (when configured)

**Why this belongs in `BaseProvider` (even for streaming):**
- These artifacts are **model/template-specific**, not provider-specific (the same model can be served via Ollama, vLLM, LMStudio, HF, or MLX)
- In streaming mode, wrappers often appear in the first/last chunks; stripping them incrementally avoids leaking markup into UIs and tool parsers without buffering the full response

Configuration comes from `abstractcore/assets/architecture_formats.json` and `abstractcore/assets/model_capabilities.json`; implementation lives in `abstractcore/architectures/response_postprocessing.py`.

#### Model Metadata Registry (Source of Truth)
AbstractCore's model capability routing and architecture formatting are driven by two canonical JSON registries:
- `abstractcore/assets/model_capabilities.json` — model limits, tool/structured output flags, multimodal support, aliases
- `abstractcore/assets/architecture_formats.json` — message formats, tool call syntax, response wrappers, detection patterns

When a new model or architecture is released (or an existing one changes), update these files first. See `abstractcore/assets/README.md` for field requirements and update rules.

#### Memory Management

The `unload_model(model_name)` method is a **best-effort resource cleanup hook**.

- **API providers** (OpenAI, Anthropic): typically a no-op (safe to call).
- **Local / self-hosted providers**: behavior is provider-specific:
  - some can actively release memory (or request server-side eviction),
  - others can only close client connections and rely on server-side TTL/auto-eviction.
  - Examples: **Ollama** uses native `keep_alive` load/unload semantics, and
    **LM Studio** uses its native loaded-instance REST API when available.

Provider/server availability and model catalog membership are not loaded-model proof.
When Core can verify residency, providers expose `get_model_residency(...)`; otherwise
loaded state is reported as unknown/fail-closed.

In the OpenAI-compatible AbstractCore server (`abstractcore.server.app`), requests can set `unload_after` (default `false`)
to call `llm.unload_model(model)` after the request completes. For providers that can unload shared server state (e.g. Ollama),
this is disabled by default and must be explicitly enabled by the server operator.

```python
# Load model, use it, then free memory
llm = create_llm("ollama", model="large-model")
response = llm.generate("Hello")
llm.unload_model(llm.model)  # Explicitly free memory
del llm
```

This is critical for:
- Test suites that load multiple models sequentially
- Memory-constrained environments (<32GB RAM)
- Production systems serving different models sequentially

### 3. Media Handling System

AbstractCore includes a policy-driven media handling system that enables file attachments across all providers:

```mermaid
graph TD
    A[User Input: @file.pdf] --> B[MessagePreprocessor]
    B --> C[Extract Files + Clean Text]
    C --> D[AutoMediaHandler]
    D --> E{File Type Detection}
    E -->|Images| F[ImageProcessor]
    E -->|PDFs| G[PDFProcessor]
    E -->|Office| H[OfficeProcessor]
    E -->|Text/CSV| I[TextProcessor]

    F --> J[MediaContent Objects]
    G --> J
    H --> J
    I --> J

    J --> K{Provider Type}
    K -->|OpenAI| L[OpenAI Format]
    K -->|Anthropic| M[Anthropic Format]
    K -->|Local| N[Text Embedding]

    L --> O[Provider API Call]
    M --> O
    N --> O

    style D fill:#4caf50
    style J fill:#2196f3
    style O fill:#ff9800
```

#### Media System Architecture

**Core Components:**
- **MessagePreprocessor**: Parses `@filename` syntax in CLI and extracts file references
- **AutoMediaHandler**: Intelligent coordinator that selects appropriate processors
- **Specialized Processors**:
  - `ImageProcessor` (PIL-based for images)
  - `PDFProcessor` (pypdf by default for permissive PDF text/metadata extraction; optional PyMuPDF4LLM backend is explicit opt-in)
  - `OfficeProcessor` (Unstructured for DOCX/XLSX/PPTX)
  - `TextProcessor` (pandas for CSV/TSV data analysis)
- **Provider Handlers**: Format media content for each provider's API requirements

**Provider-Specific Formatting:**
```python
# Same MediaContent gets formatted differently:

# OpenAI (JSON with image_url):
{
  "role": "user",
  "content": [
    {"type": "text", "text": "Analyze this"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
  ]
}

# Anthropic (Messages API with source):
{
  "role": "user",
  "content": [
    {"type": "text", "text": "Analyze this"},
    {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}
  ]
}

# Local (Text embedding):
"Analyze this\n\nImage description: A chart showing quarterly trends..."
```

**Graceful Fallback Strategy:**
1. **Specialized Processing**: pypdf for PDFs by default, Unstructured for Office documents
2. **Basic Processing**: Simple text extraction
3. **Metadata Fallback**: File information and properties
4. **Degrades gracefully for documents**: PDFs/Office/text aim to return best-effort extracted text/metadata rather than crashing.
5. **Policy-driven for true multimodal inputs**: for image/audio/video message parts, behavior is policy-driven; unsupported requests fail loudly unless an explicit enrichment fallback is configured (see `docs/media-handling-system.md` and `docs/vision-capabilities.md`).

#### Unified Media API

The same `media=[]` parameter works across all providers:

```python
# Universal API - works with any provider
llm = create_llm("openai", model="gpt-4o")  # or "anthropic", "ollama", etc.
response = llm.generate(
    "Analyze these files",
    media=["report.pdf", "chart.png", "data.xlsx"]
)
```

**CLI Integration:**
```bash
# Simple @filename syntax works everywhere
python -m abstractcore.utils.cli --prompt "What's in @document.pdf and @image.jpg"
```

#### Capability plugins (voice/audio/vision/music)
To keep the default `abstractcore` install dependency-light while still enabling deterministic modality APIs, AbstractCore supports optional **capability plugins**:
- `abstractvoice` provides `core.voice` + `core.audio` (TTS/STT).
- `abstractvision` provides `core.vision` (T2I/I2I/T2V/I2V; backend-pluggable).
- `abstractmusic` can provide `core.music` (text-to-music) when installed; `abstractmusic>=0.1.13` includes the remote ACE Music backend through `ACEMUSIC_API_KEY`.

Discovery:
- `llm.capabilities.status()` returns a JSON-safe snapshot (which backends are available/selected, plus install hints).
- `llm.capabilities.list_backend_infos()` returns registered backend metadata without instantiating backend factories.
- `llm.capabilities.available_providers(capability, task=...)` and
  `llm.capabilities.list_models(capability, task=..., provider=...)` expose normalized plugin
  discovery where supported.
- Convenience facades exist as properties: `llm.voice`, `llm.audio`, `llm.vision`, and `llm.music` (lazy; missing plugins raise actionable errors).

### 4. Request Lifecycle

```mermaid
sequenceDiagram
    participant App as Your App
    participant Core as AbstractCore
    participant Events as Event System
    participant Retry as Retry Logic
    participant Provider as LLM Provider
    participant Tools as Tool System

    App->>Core: generate("prompt", tools=tools)
    Core->>Events: emit(GENERATION_STARTED)
    Core->>Retry: wrap_with_retry()

    alt Provider Call Success
        Retry->>Provider: API call
        Provider->>Retry: response
        Retry->>Core: successful response
    else Provider Call Fails
        Retry->>Provider: API call (attempt 1)
        Provider->>Retry: rate limit error
        Retry->>Retry: wait with backoff
        Retry->>Provider: API call (attempt 2)
        Provider->>Retry: success
        Retry->>Core: successful response
    end

    alt Has Tool Calls
        Core->>Events: emit(TOOL_STARTED)
        Core->>Tools: execute_tools()
        Tools->>Core: tool results
        Core->>Events: emit(TOOL_COMPLETED)
    end

    Core->>Events: emit(GENERATION_COMPLETED)
    Core->>App: GenerateResponse
```

Note: in the Python API, `execute_tools` defaults to `False` (**pass-through**). Tool calls are returned in `GenerateResponse.tool_calls` for your host/runtime to execute. `execute_tools=True` exists for simple demos but is deprecated for most production use cases. The optional HTTP gateway server runs in pass-through mode.

### 5. Tool System Architecture

The tool system provides universal tool-call detection (and optional local execution) across all providers:

```mermaid
graph TD
    A[LLM Response] --> B{Has Tool Calls?}
    B -->|No| C[Return Response]
    B -->|Yes| D[Parse Tool Calls]
    D --> E[Event: TOOL_STARTED]
    E --> F{Event Prevented?}
    F -->|Yes| G[Skip Tool Execution]
    F -->|No| H[Execute Tools]
    H --> I[Collect Results]
    I --> J[Event: TOOL_COMPLETED]
    J --> K[Append Results to Response]
    K --> C

    style D fill:#ffeb3b
    style H fill:#4caf50
    style E fill:#ff9800
```

#### Tool Execution Flow

1. **Tool Detection**: Parse tool calls from LLM response
2. **Event Emission**: Emit `TOOL_STARTED` (preventable)
3. **Optional local execution (deprecated)**: execute tools inside AbstractCore when `execute_tools=True` (providers never execute arbitrary local tools)
4. **Result Collection**: Gather results and error information
5. **Event Emission**: Emit `TOOL_COMPLETED` with results
6. **Response Integration**: Append tool results to original response

#### Provider-Specific Tool Handling with Tag Rewriting

```mermaid
graph LR
    A[Tool Definition] --> B{Provider Type}
    B --> C[OpenAI: Native JSON]
    B --> D[Anthropic: Native XML]
    B --> E[Ollama: Architecture-specific]
    B --> F[Others: Prompted Format]

    C --> G[LLM Generation]
    D --> G
    E --> G
    F --> G

    G --> H[Tool Call Tag Rewriter]
    H --> I[Target Format Conversion]
    I --> J[Universal Tool Parser]
    J --> K[Local Tool Execution]

    style A fill:#e1f5fe
    style H fill:#ff9800
    style I fill:#9c27b0
    style K fill:#4caf50
```

#### Tool Call Tag Rewriting System

AbstractCore includes a sophisticated tag rewriting system that enables compatibility with any agentic CLI:

**Rewriting Pipeline**:

```mermaid
graph TD
    A[Raw LLM Response] --> B[Pattern Detection]
    B --> C{Tag Format Needed?}
    C -->|No| D[Default Qwen3 Format]
    C -->|Yes| E[Target Format Conversion]

    E --> F{Format Type}
    F -->|Predefined| G[llama3, xml, gemma, etc.]
    F -->|Custom| H[User-defined Tags]

    G --> I[Rewritten Tool Call]
    H --> I
    D --> I

    I --> J[Tool Execution]

    style B fill:#2196f3
    style E fill:#ff9800
    style I fill:#4caf50
```

**Supported Formats**:
- **Default (Qwen3)**: `<|tool_call|>...JSON...</|tool_call|>` - Compatible with Codex CLI
- **LLaMA3**: `<function_call>...JSON...</function_call>` - Compatible with Crush CLI
- **XML**: `<tool_call>...JSON...</tool_call>` - Compatible with Gemini CLI
- **Gemma**: ````tool_code...JSON...```` - Compatible with Gemma models
- **Custom**: Any user-defined format (e.g., `[TOOL]...JSON...[/TOOL]`)

**Real-Time Integration**:
- **Streaming Compatible**: Works seamlessly with unified streaming architecture
- **Zero Latency**: No additional processing delays
- **Universal Detection**: Automatically detects source format from any model
- **Graceful Fallback**: Returns original content if rewriting fails

### 6. Retry and Reliability System

Production-grade error handling with multiple layers:

```mermaid
graph TD
    A[LLM Request] --> B[Retry Manager]
    B --> C{Error Type}
    C -->|Rate Limit| D[Exponential Backoff]
    C -->|Network Error| D
    C -->|Timeout| D
    C -->|Auth Error| E[Fail Fast]
    C -->|Invalid Request| E

    D --> F{Max Attempts?}
    F -->|No| G[Wait + Jitter]
    G --> H[Retry Request]
    H --> B
    F -->|Yes| I[Circuit Breaker]

    I --> J{Failure Threshold?}
    J -->|No| K[Return Error]
    J -->|Yes| L[Open Circuit]
    L --> M[Fail Fast for Duration]

    style D fill:#ff9800
    style I fill:#f44336
    style L fill:#d32f2f
```

#### Retry Configuration

```python
from abstractcore import create_llm
from abstractcore.core.retry import RetryConfig

config = RetryConfig(
    max_attempts=3,           # Try up to 3 times
    initial_delay=1.0,        # Start with 1 second delay
    max_delay=60.0,           # Cap at 1 minute
    use_jitter=True,          # Add randomness
    failure_threshold=5,      # Circuit breaker after 5 failures
    recovery_timeout=60.0     # Test recovery after 1 minute
)

llm = create_llm("openai", model="gpt-4o-mini", retry_config=config)
```

### 7. Event System

Observability hooks through events:

```mermaid
graph TD
    A[LLM Operation] --> B[Event Emission]
    B --> C[Global Event Bus]
    C --> D[Event Listeners]

    D --> E[Monitoring]
    D --> F[Logging]
    D --> G[Cost Tracking]
    D --> H[Tool Control]
    D --> I[Custom Logic]

    E --> J[Metrics Dashboard]
    F --> K[Log Files]
    G --> L[Cost Alerts]
    H --> M[Security Gates]
    I --> N[Business Logic]

    style B fill:#9c27b0
    style C fill:#673ab7
    style H fill:#f44336
```

#### Event Types and Use Cases

```python
from abstractcore.events import EventType, on_global

# Cost monitoring (best-effort estimate; based on token usage)
def monitor_costs(event):
    if event.type != EventType.GENERATION_COMPLETED:
        return
    cost = event.data.get("cost_usd")
    if isinstance(cost, (int, float)) and cost > 0.10:
        alert(f"High estimated cost: ${cost:.2f}")

# Tool monitoring
def log_tools(event):
    if event.type == EventType.TOOL_COMPLETED:
        log(f"Tool completed: {event.data.get('tool_name')}")

# Performance tracking
def track_performance(event):
    if event.type != EventType.GENERATION_COMPLETED:
        return
    duration_ms = event.data.get("duration_ms")
    if isinstance(duration_ms, (int, float)) and duration_ms > 10_000:
        log(f"Slow request: {float(duration_ms):.0f}ms")

on_global(EventType.GENERATION_COMPLETED, monitor_costs)
on_global(EventType.TOOL_COMPLETED, log_tools)
on_global(EventType.GENERATION_COMPLETED, track_performance)
```

### 8. Structured Output System with Streaming Integration

Type-safe responses with automatic validation, retry, and unified streaming:

```mermaid
graph TD
    A[LLM Generate] --> B{Streaming Mode?}
    B -->|Yes| C[Unified Streaming Processor]
    B -->|No| D[Standard JSON Parsing]

    C --> E[Incremental Tool Detector]
    E --> F[Real-time Chunk Processing]
    F --> G[Tool Call Detection]
    G --> H[Mid-Stream Tool Execution]

    D --> I[Parse JSON]
    I --> J{Valid JSON?}
    J -->|No| K[Retry with Error Feedback]
    J -->|Yes| L[Pydantic Validation]

    L --> M{Valid Model?}
    M -->|No| K
    M -->|Yes| N[Return Typed Object]

    K --> O{Max Retries?}
    O -->|No| A
    O -->|Yes| P[Raise ValidationError]

    style C fill:#4caf50
    style E fill:#2196f3
    style F fill:#ff9800
    style G fill:#9c27b0
    style K fill:#f44336
```

#### Unified Streaming Architecture

AbstractCore’s streaming system provides character-by-character streaming with incremental tool detection and optional tool-call syntax rewriting.

**Architecture Components**:

```mermaid
graph TD
    A[Stream Input] --> B[UnifiedStreamProcessor]
    B --> C[IncrementalToolDetector]
    C --> D[Tag Rewriter]
    D --> E[Tool Execution (optional)]
    E --> F[Stream Output]

    B --> G[Character-by-Character Handling]
    G --> H[Intelligent Buffering]
    H --> C

    style B fill:#4caf50
    style C fill:#2196f3
    style D fill:#ff9800
    style E fill:#9c27b0
```

**Key Features**:

1. **Unified Streaming Strategy**
   - Single consistent approach across all providers
   - Best-effort time-to-first-token (TTFT) telemetry for debugging
   - Minimal buffering (incremental parsing)

2. **Incremental Tool Detection**
   - Real-time tool call detection during streaming
   - Emits `chunk.tool_calls` as soon as a full tool call is detected
   - Handles partial tool calls across chunk boundaries

3. **Character-by-Character Streaming**
   - Handles micro-chunking from providers (very small deltas)
   - Intelligent buffering for partial tool calls
   - Robust parsing with auto-repair for malformed JSON

4. **Tool Call Tag Rewriting Integration**
   - Real-time format conversion during streaming
   - Support for multiple formats (Qwen3, LLaMA3, Gemma, XML, custom)
   - Designed to avoid large buffering while keeping tool calls structured

**Streaming with Tag Rewriting Example**:
```python
from abstractcore import create_llm, tool

@tool
def analyze_code(code: str) -> str:
    """Return a small, deterministic analysis."""
    return f"chars={len(code)}"

llm = create_llm("ollama", model="qwen3:4b-instruct")  # requires Ollama running (default: http://localhost:11434)
for chunk in llm.generate(
    "Write a Python function, then call analyze_code on it.",
    stream=True,
    tools=[analyze_code],
    tool_call_tags="llama3",  # Emit <function_call>...</function_call> style tags
):
    print(chunk.content or "", end="", flush=True)
    if chunk.tool_calls:
        print(f"\nTool calls: {chunk.tool_calls}")

# Output format: <function_call>{"name": "analyze_code"}...</function_call>
```

Implementation pointers (source of truth):
- Unified streaming + tool detection: `abstractcore/providers/streaming.py`
- Streaming wrapper + TTFT metadata: `abstractcore/providers/base.py`

#### Automatic Error Feedback

When validation fails, AbstractCore provides detailed feedback to the LLM:

```python
# If LLM returns invalid data, AbstractCore automatically retries with:
"""
IMPORTANT: Your previous response had validation errors:
• Field 'age': Age must be positive (got -25)
• Field 'email': Invalid email format

Please correct these errors and provide valid JSON.
"""
```

### 9. Session Management

Simple conversation memory without complexity:

```mermaid
graph LR
    A[BasicSession] --> B[Message History]
    A --> C[System Prompt]
    A --> D[Provider Reference]

    B --> E[generate()]
    C --> E
    D --> E

    E --> F[Add to History]
    F --> G[Return Response]

    A --> H[save()/load()]
    H --> I[JSON Persistence]

    style A fill:#2196f3
    style B fill:#4caf50
```

### 10. Server Architecture (Optional Component)

The AbstractCore server provides OpenAI-compatible HTTP endpoints built on top of the core library:

```mermaid
	graph TD
	    A[HTTP Client] --> B[FastAPI Server]
	    B --> C{Endpoint Router}
	    
	    C --> D[/v1/chat/completions]
	    C --> E[/v1/embeddings]
	    C --> F[/v1/models]
	    C --> G[/providers]
	    C --> Img[/v1/images/* (optional)]
	    C --> Aud[/v1/audio/* (optional)]
	    C --> Cache[/acore/prompt_cache/*]
    
    D --> H[Request Validation]
    E --> H
    F --> I[Provider Discovery]
    G --> I
    
    H --> J[AbstractCore Library]
    I --> J
    
    J --> K[Provider Interface]
    K --> L[LLM Providers]
    
    style B fill:#4caf50
    style J fill:#e1f5fe
    style K fill:#f3e5f5
```

**Architecture Layers**:

1. **HTTP Layer**: FastAPI-based REST API with request validation
2. **Translation Layer**: Converts HTTP requests to AbstractCore library calls
3. **Core Layer**: Uses the full AbstractCore provider system
4. **Response Layer**: Transforms responses to OpenAI-compatible format

**Key Capabilities**:

- **OpenAI Compatibility**: Drop-in replacement for OpenAI API clients
- **Universal Provider Access**: Single API for all providers (OpenAI, Anthropic, Ollama, etc.)
- **Format Conversion**: Automatic tool call format conversion for agentic CLIs
- **Streaming Support**: Server-sent events for real-time responses
- **Model Discovery**: Dynamic model listing across all providers
- **Embedding Support**: Multi-provider embedding generation (HuggingFace, Ollama, LMStudio)
- **Optional Vision Endpoints**: OpenAI-compatible `/v1/images/generations` and `/v1/images/edits` (plus `/v1/vision/*` catalog/residency control plane for image/video-capable backends) delegated to `abstractvision` (safe-by-default; requires explicit config).
- **Optional Audio Endpoints**: OpenAI-compatible `/v1/audio/transcriptions` and `/v1/audio/speech` delegated to capability plugins (typically `abstractvoice`).
- **Prompt Cache Control Plane**: `/acore/prompt_cache/*` proxy endpoints for cache stats/set/update/fork/clear (best-effort; typically targets an `abstractcore.endpoint` upstream).

**Request Flow Example**:

```mermaid
sequenceDiagram
    participant Client
    participant Server as FastAPI Server
    participant Core as AbstractCore
    participant Provider as LLM Provider
    
    Client->>Server: POST /v1/chat/completions
    Server->>Server: Validate Request
    Server->>Core: create_llm(provider, model)
    Server->>Core: llm.generate(messages, tools)
    Core->>Provider: API call with retry logic
    Provider->>Core: Response
    Core->>Core: Execute tools if needed
    Core->>Server: GenerateResponse
    Server->>Server: Convert to OpenAI format
    Server->>Client: HTTP Response (streaming or complete)
```

**Server Features**:

- **Automatic Retry**: Built-in retry logic from core library
- **Event System**: Full observability through events
- **Debug Logging**: Comprehensive request/response logging
- **Health Checks**: `/health` endpoint for monitoring
- **Interactive Docs**: Auto-generated Swagger UI at `/docs`
- **Multi-Worker Support**: Production deployment with multiple workers

## Architecture Benefits

### 1. Provider Agnostic
- **Same code works everywhere**: Switch providers by changing one line
- **No vendor lock-in**: Easy migration between cloud and local providers
- **Consistent semantics**: tools, streaming, and structured output follow the same API surface (provider/model differences still apply)

### 2. Production Ready
- **Automatic reliability**: Built-in retry logic and circuit breakers
- **Comprehensive observability**: Events for every operation
- **Error handling**: Proper error classification and handling

### 3. Extensible
- **Event system**: Hook into any operation
- **Tool system**: Add new tools easily
- **Provider system**: Add new providers with minimal code

### 4. Performance Optimized
- **Lazy loading**: Providers loaded only when needed
- **Connection pooling**: Reuse HTTP connections
- **Efficient parsing**: Optimized JSON and tool parsing

## Extension Points

AbstractCore is designed to be extended:

### Adding a New Provider

```python
from abstractcore.providers.base import BaseProvider

class MyProvider(BaseProvider):
    def generate(self, prompt: str, **kwargs) -> GenerateResponse:
        # Implement provider-specific logic
        return GenerateResponse(content="...")

    def get_capabilities(self) -> List[str]:
        return ["text_generation", "streaming"]
```

### Adding Tools

```python
from abstractcore import tool

@tool
def my_custom_tool(param: str) -> str:
    """Custom tool that does something useful."""
    return f"Processed: {param}"
```

## Performance Characteristics

AbstractCore’s overhead is usually small compared to model inference and network latency. If performance matters, benchmark on your target provider/model/hardware.

Common levers:
- Provider choice and base URL latency
- Concurrency (async + connection pooling)
- Streaming vs non-streaming
- Structured output (schema size, retry behavior)
- Tool execution strategy (pass-through vs host execution)

## Security Considerations

### 1. Tool Execution Safety
- **Local execution (optional)**: tool execution is local (never executed by the provider); by default tool calls are returned for your host/runtime to execute
- **Event prevention**: Stop dangerous tools before execution
- **Input validation**: Validate tool parameters

### 2. API Key Management
- **Environment variables**: Secure key storage
- **Avoid logging**: treat logs as sensitive; do not log secrets (AbstractCore tries to avoid printing keys in logs)
- **Provider isolation**: Keys scoped to specific providers

### 3. Data Privacy
- **Local options**: Support for local providers (Ollama, MLX)
- **No persistent storage by default**: conversation state lives in memory (for example `BasicSession`) unless you explicitly save it or enable tracing/logging
- **Transparent processing**: All operations are observable through events

## Testing Strategy

The repo uses a mix of unit tests and integration tests. Some tests are provider-/network-/hardware-dependent and are opt-in.

Quick pointers:
- Run: `pytest -q`
- Vision tests: `tests/README_VISION_TESTING.md`
- Seed tests: `tests/README_SEED_TESTING.md`
- Streaming/tool parsing tests: `tests/streaming/` and `tests/test_agentic_cli_compatibility.py`
- Server/endpoint tests: `tests/server/` and `tests/test_abstractendpoint_singleton_provider.py`

## Integration with AbstractFramework

AbstractCore is a core package in the **AbstractFramework** ecosystem:

- AbstractFramework (umbrella): https://github.com/lpalbou/AbstractFramework
- AbstractCore (this repo): https://github.com/lpalbou/AbstractCore
- AbstractRuntime: https://github.com/lpalbou/abstractruntime

In this ecosystem, AbstractCore focuses on **LLM I/O + provider abstraction**, while AbstractRuntime focuses on **durable execution** (effects/tools/workflows/state). AbstractCore remains usable standalone; when you need durability/policy/sandboxing around tools, plug it into a runtime (for example AbstractRuntime).

```mermaid
graph TD
    subgraph "UI Layer (peers)"
        A[AbstractCode<br/>Terminal CLI]
        B[AbstractFlow Visual Editor<br/>React + ReactFlow]
    end

    A -.->|optional| F[AbstractFlow Engine]
    B --> F

    F --> C[AbstractAgent]
    A --> C
    C --> D[AbstractRuntime]
    D --> E[AbstractCore]
    E --> G[LLM Providers]

    style E fill:#e1f5fe
    style A fill:#fff3e0
    style B fill:#fff3e0
    style F fill:#f3e5f5
    style C fill:#f3e5f5
    style D fill:#f3e5f5
```

### Framework Layers
- **UI Layer** (peers):
  - AbstractCode: Terminal CLI for interactive sessions
  - AbstractFlow Visual Editor: Web-based diagram editor (React + ReactFlow + FastAPI)
- **AbstractFlow**: Multi-agent orchestration engine + visual editor
- **AbstractAgent**: Agent patterns (ReactAgent, CodeActAgent) with durable execution
- **AbstractRuntime**: Effect system, workflows, state persistence

AbstractCode can optionally use AbstractFlow for running flows. AbstractFlow includes its own visual editor for designing workflows.

## Summary

AbstractCore's architecture prioritizes:

1. **Reliability** - Production-grade error handling and retry logic
2. **Simplicity** - Clean APIs that are easy to understand and use
3. **Universality** - Same interface and features across all providers
4. **Extensibility** - Clear extension points for advanced features
5. **Observability** - Comprehensive events for monitoring and control
6. **Flexibility** - Deploy as Python library or OpenAI-compatible HTTP server

The result is a foundation that works reliably in production while remaining simple enough to learn quickly and flexible enough to build advanced applications on top of.

---

### Inlined: `docs/examples.md`

# Practical Examples

This guide shows real-world use cases for AbstractCore with complete, copy-paste examples. All examples work across any provider - just change the provider name.

## Table of Contents

- [Basic Usage](#basic-usage)
- [Glyph Visual-Text Compression](#glyph-visual-text-compression)
- [Tool Calling Examples](#tool-calling-examples)
- [Tool Call Syntax Rewriting Examples](#tool-call-syntax-rewriting-examples)
- [Structured Output Examples](#structured-output-examples)
- [Streaming Examples](#streaming-examples)
- [Session Management](#session-management)
- [Interaction Tracing (Observability)](#interaction-tracing-observability)
- [Production Patterns](#production-patterns)
- [Integration Examples](#integration-examples)

## Basic Usage

### Simple Q&A

```python
from abstractcore import create_llm

# Works with any provider
llm = create_llm("openai", model="gpt-4o-mini")  # or "anthropic", "ollama"...

response = llm.generate("What is the difference between Python and JavaScript?")
print(response.content)
```

### Multiple Providers Comparison

```python
from abstractcore import create_llm

providers = [
    ("openai", "gpt-4o-mini"),
    ("anthropic", "claude-haiku-4-5"),
    ("ollama", "qwen3:4b-instruct")
]

question = "Explain Python list comprehensions with examples"

for provider_name, model in providers:
    try:
        llm = create_llm(provider_name, model=model)
        response = llm.generate(question)
        print(f"\n--- {provider_name.upper()} ---")
        print(response.content[:200] + "...")
    except Exception as e:
        print(f"{provider_name} failed: {e}")
```

### Provider Fallback

```python
from abstractcore import create_llm

def generate_with_fallback(prompt, **kwargs):
    """Try multiple providers until one works."""
    providers = [
        ("openai", "gpt-4o-mini"),
        ("anthropic", "claude-haiku-4-5"),
        ("ollama", "qwen3:4b-instruct")
    ]

    for provider_name, model in providers:
        try:
            llm = create_llm(provider_name, model=model)
            return llm.generate(prompt, **kwargs)
        except Exception as e:
            print(f"{provider_name} failed: {e}")
            continue

    raise Exception("All providers failed")

# Usage
response = generate_with_fallback("What is machine learning?")
print(response.content)
```

## Glyph Visual-Text Compression

Glyph compression renders long text into images for vision-capable models to reduce effective token usage (often 3–4x on long text; depends on content/model).

Requires `pip install "abstractcore[compression]"` (and `pip install "abstractcore[media]"` if you want PDF/Office text extraction).

### Automatic Compression with Ollama

```python
from abstractcore import create_llm

# Use a vision-capable model - Glyph works automatically
llm = create_llm("ollama", model="llama3.2-vision:11b")

# Large documents are automatically compressed when beneficial
response = llm.generate(
    "What are the key findings and methodology in this research paper?",
    media=["research_paper.pdf"]  # Automatically compressed if size > threshold
)

print(f"Analysis: {response.content}")
print(f"Processing time: {response.gen_time}ms")

# Check if compression was used
if response.metadata and response.metadata.get('compression_used'):
    stats = response.metadata.get('compression_stats', {})
    print(f"✅ Glyph compression used!")
    print(f"Compression ratio: {stats.get('compression_ratio', 'N/A')}x")
    print(f"Original tokens: {stats.get('original_tokens', 'N/A')}")
    print(f"Compressed tokens: {stats.get('compressed_tokens', 'N/A')}")
```

### Explicit Compression Control

```python
from abstractcore import create_llm

# Force compression for testing
llm = create_llm("ollama", model="qwen2.5vl:7b")

# Always compress
response = llm.generate(
    "Summarize the main conclusions of this document",
    media=["long_document.pdf"],
    glyph_compression="always"  # Force compression
)

# Never compress (for comparison)
response_no_compression = llm.generate(
    "Summarize the main conclusions of this document", 
    media=["long_document.pdf"],
    glyph_compression="never"  # Disable compression
)

print(f"With compression: {response.gen_time}ms")
print(f"Without compression: {response_no_compression.gen_time}ms")
```

### Custom Configuration

```python
from abstractcore import create_llm
from abstractcore.compression import GlyphConfig

# Configure compression behavior
glyph_config = GlyphConfig(
    enabled=True,
    global_default="auto",           # "auto", "always", "never"
    quality_threshold=0.95,          # Minimum quality score (0-1)
    target_compression_ratio=3.0,    # Target compression ratio
    provider_optimization=True,      # Enable provider-specific optimization
    cache_enabled=True,             # Enable compression caching
    provider_profiles={
        "ollama": {
            "dpi": 150,              # Higher DPI for better quality
            "font_size": 9,          # Smaller font for more content
            "quality_threshold": 0.95
        }
    }
)

llm = create_llm("ollama", model="granite3.2-vision:latest", glyph_config=glyph_config)

response = llm.generate(
    "Analyze the figures and tables in this academic paper",
    media=["academic_paper.pdf"]
)
```

### Performance Benchmarking

```python
import time
from abstractcore import create_llm

def benchmark_glyph_compression(document_path, model_name="llama3.2-vision:11b"):
    """Compare processing with and without Glyph compression"""
    
    llm = create_llm("ollama", model=model_name)
    
    # Test without compression
    start = time.time()
    response_no_glyph = llm.generate(
        "Provide a detailed analysis of this document",
        media=[document_path],
        glyph_compression="never"
    )
    time_no_glyph = time.time() - start
    
    # Test with compression
    start = time.time()
    response_glyph = llm.generate(
        "Provide a detailed analysis of this document",
        media=[document_path],
        glyph_compression="always"
    )
    time_glyph = time.time() - start
    
    # Compare results
    print(f"📊 Glyph Compression Benchmark")
    print(f"Document: {document_path}")
    print(f"Model: {model_name}")
    print(f"")
    print(f"Without Glyph: {time_no_glyph:.2f}s")
    print(f"With Glyph:    {time_glyph:.2f}s")
    print(f"Speedup:       {time_no_glyph/time_glyph:.2f}x")
    print(f"")
    print(f"Response quality comparison:")
    print(f"No Glyph length:  {len(response_no_glyph.content)} chars")
    print(f"Glyph length:     {len(response_glyph.content)} chars")
    
    return response_glyph, response_no_glyph

# Run benchmark
glyph_response, normal_response = benchmark_glyph_compression("large_document.pdf")
```

### Multi-Provider Testing

```python
from abstractcore import create_llm

# Test Glyph across different providers and models
models_to_test = [
    ("ollama", "llama3.2-vision:11b"),
    ("ollama", "qwen2.5vl:7b"),
    ("ollama", "granite3.2-vision:latest"),
    # Add LMStudio if running
    # ("lmstudio", "your-vision-model"),
]

document = "research_paper.pdf"
question = "What are the key innovations presented in this paper?"

for provider, model in models_to_test:
    try:
        print(f"\n🧪 Testing {provider} - {model}")
        
        llm = create_llm(provider, model=model)
        
        response = llm.generate(
            question,
            media=[document],
            glyph_compression="auto"
        )
        
        print(f"✅ Success - {response.gen_time}ms")
        print(f"Response: {response.content[:100]}...")
        
        # Check compression usage
        if response.metadata and response.metadata.get('compression_used'):
            print(f"🎨 Glyph compression was used")
        else:
            print(f"📝 Standard processing was used")
            
    except Exception as e:
        print(f"❌ Failed: {e}")
```

**Key Benefits Demonstrated:**
- **Automatic optimization**: Glyph decides when compression is beneficial
- **Transparent integration**: Works with existing media handling code
- **Quality preservation**: No loss of analytical accuracy
- **Provider flexibility**: Works across Ollama, LMStudio, and other vision providers

[Learn more about Glyph configuration and advanced features](glyphs.md)

## Tool Calling Examples

### Weather Tool

```python
from abstractcore import create_llm
import requests

def get_weather(city: str, units: str = "metric") -> str:
    """Get current weather for a city."""
    # In production, use a real weather API
    # This is a simulated implementation
    temperatures = {
        "paris": "22°C, sunny",
        "london": "15°C, cloudy",
        "tokyo": "28°C, humid",
        "new york": "18°C, windy"
    }
    return temperatures.get(city.lower(), f"Weather data not available for {city}")

# Tool definition
weather_tool = {
    "name": "get_weather",
    "description": "Get current weather information for a city",
    "parameters": {
        "type": "object",
        "properties": {
            "city": {
                "type": "string",
                "description": "Name of the city"
            },
            "units": {
                "type": "string",
                "enum": ["metric", "imperial"],
                "description": "Temperature units"
            }
        },
        "required": ["city"]
    }
}

# Works with any provider that supports tools
llm = create_llm("openai", model="gpt-4o-mini")

response = llm.generate(
    "What's the weather like in Paris and London?",
    tools=[weather_tool]
)

print(response.content)
print(response.tool_calls)  # Structured tool call requests (host/runtime executes them)
```

### Calculator Tool

```python
from abstractcore import create_llm
import math

def calculate(expression: str) -> str:
    """Safely evaluate mathematical expressions."""
    try:
        # In production, use a proper expression parser
        # This is simplified for demo purposes
        allowed_chars = set('0123456789+-*/.() ')
        if not all(c in allowed_chars for c in expression):
            return "Error: Invalid characters in expression"

        result = eval(expression)
        return f"{expression} = {result}"
    except Exception as e:
        return f"Error calculating {expression}: {str(e)}"

def sqrt(number: float) -> str:
    """Calculate square root."""
    try:
        result = math.sqrt(number)
        return f"√{number} = {result}"
    except Exception as e:
        return f"Error: {str(e)}"

# Tool definitions
tools = [
    {
        "name": "calculate",
        "description": "Perform basic mathematical calculations",
        "parameters": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Mathematical expression"}
            },
            "required": ["expression"]
        }
    },
    {
        "name": "sqrt",
        "description": "Calculate square root of a number",
        "parameters": {
            "type": "object",
            "properties": {
                "number": {"type": "number", "description": "Number to calculate square root of"}
            },
            "required": ["number"]
        }
    }
]

llm = create_llm("openai", model="gpt-4o-mini")

response = llm.generate(
    "What is 25 * 4 + 12, and what's the square root of 144?",
    tools=tools
)

print(response.content)
print(response.tool_calls)  # Structured tool call requests (host/runtime executes them)
```

### File Operations Tool

```python
from abstractcore import create_llm
from pathlib import Path
import os

def list_files(directory: str = ".") -> str:
    """List files in a directory."""
    try:
        path = Path(directory)
        if not path.exists():
            return f"Directory {directory} does not exist"

        files = []
        for item in path.iterdir():
            if item.is_file():
                files.append(f"FILE: {item.name}")
            elif item.is_dir():
                files.append(f"DIR: {item.name}/")

        return f"Contents of {directory}:\n" + "\n".join(sorted(files))
    except Exception as e:
        return f"Error listing files: {str(e)}"

def read_file(filename: str) -> str:
    """Read contents of a text file."""
    try:
        path = Path(filename)
        if not path.exists():
            return f"File {filename} does not exist"

        content = path.read_text(encoding='utf-8')
        return f"Contents of {filename}:\n{content}"
    except Exception as e:
        return f"Error reading file: {str(e)}"

# Tool definitions
file_tools = [
    {
        "name": "list_files",
        "description": "List files and directories in a given path",
        "parameters": {
            "type": "object",
            "properties": {
                "directory": {"type": "string", "description": "Directory path to list"}
            }
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a text file",
        "parameters": {
            "type": "object",
            "properties": {
                "filename": {"type": "string", "description": "Path to the file to read"}
            },
            "required": ["filename"]
        }
    }
]

llm = create_llm("anthropic", model="claude-haiku-4-5")

response = llm.generate(
    "List the files in the current directory and read the README.md file if it exists",
    tools=file_tools
)

print(response.content)
print(response.tool_calls)  # Structured tool call requests (host/runtime executes them)
```

## Tool Call Syntax Rewriting Examples

> **Real-time tool call format conversion for agentic CLI compatibility**

Tool call syntax rewriting enables AbstractCore to work seamlessly with any agentic CLI by converting tool calls to the expected format in real-time. This happens automatically during generation, including streaming.

> **Related**: [Tool Call Syntax Rewriting Guide](tool-syntax-rewriting.md)

### Codex CLI Integration (Qwen3 Tags)

```python
from abstractcore import create_llm

# Define tools (standard JSON format)
weather_tool = {
    "name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
    }
}

# Codex CLI expects qwen3-style tool-call tags in assistant content.
# By default, AbstractCore strips tool-call markup from `response.content`;
# pass `tool_call_tags` to preserve/emit the tags for downstream parsers.
llm = create_llm("ollama", model="qwen3:4b-instruct")
response = llm.generate("What's the weather in Tokyo?", tools=[weather_tool], tool_call_tags="qwen3")

print(response.content)
print(response.tool_calls)
# Content includes: <|tool_call|>{"name": "get_weather", "arguments": {"city": "Tokyo"}}</|tool_call|>
```

### Crush CLI Integration

```python
# Crush CLI expects LLaMA3 format - just specify the format
llm = create_llm("ollama", model="qwen3:4b-instruct")
response = llm.generate("Get weather for London", tools=[weather_tool], tool_call_tags="llama3")

print(response.content)
# Output includes: <function_call>{"name": "get_weather", "arguments": {"city": "London"}}</function_call>
```

### Custom CLI Format

```python
# Your custom CLI expects: [TOOL]...JSON...[/TOOL]
llm = create_llm("ollama", model="qwen3:4b-instruct")
response = llm.generate("Check weather in Paris", tools=[weather_tool], tool_call_tags="[TOOL],[/TOOL]")

print(response.content)
# Output includes: [TOOL]{"name": "get_weather", "arguments": {"city": "Paris"}}[/TOOL]
```

### Real-Time Streaming with Tag Rewriting

```python
# Streaming works seamlessly with any format
calculator_tool = {
    "name": "calculate",
    "description": "Perform mathematical calculations",
    "parameters": {
        "type": "object",
        "properties": {"expression": {"type": "string"}},
        "required": ["expression"]
    }
}

llm = create_llm("ollama", model="qwen3-coder:30b")

print("AI: ", end="", flush=True)
for chunk in llm.generate(
    "Calculate 15 * 23 and explain the result",
    tools=[calculator_tool],
    stream=True,
    tool_call_tags="llama3",
):
    print(chunk.content, end="", flush=True)

    # Tool calls are surfaced in real-time (execution is host/runtime-owned)
    if chunk.tool_calls:
        for tool_call in chunk.tool_calls:
            print(f"\n[TOOL CALL] {tool_call}")

print("\n")
# Shows: <function_call>{"name": "calculate", "arguments": {"expression": "15 * 23"}}</function_call>
# Tool execution is owned by the host/runtime.
```

### Multiple Tools with Different Formats

```python
# Define multiple tools
tools = [
    {
        "name": "get_weather",
        "description": "Get weather information",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    },
    {
        "name": "calculate",
        "description": "Perform calculations",
        "parameters": {
            "type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"]
        }
    },
    {
        "name": "list_files",
        "description": "List files in a directory",
        "parameters": {
            "type": "object",
            "properties": {"directory": {"type": "string"}},
            "required": ["directory"]
        }
    }
]

# Test with XML format for Gemini CLI
llm = create_llm("ollama", model="qwen3:4b-instruct")
response = llm.generate(
    "What's 2+2, weather in NYC, and files in current directory?",
    tools=tools,
    tool_call_tags="xml",
)

print(response.content)
print(response.tool_calls)
# All tool calls converted to: <tool_call>{"name": "...", "arguments": {...}}</tool_call>
```

### Session-Based Format Configuration

```python
from abstractcore import BasicSession

# Apply a consistent tool-call tag format across a session by reusing a variable
tool_call_tags = "llama3"

llm = create_llm("ollama", model="qwen3:4b-instruct")
session = BasicSession(provider=llm)

session.generate("Calculate 10 * 5", tools=[calculator_tool], tool_call_tags=tool_call_tags)
session.generate("What's the weather like?", tools=[weather_tool], tool_call_tags=tool_call_tags)
session.generate("List files in documents", tools=[{
    "name": "list_files",
    "description": "List directory contents",
    "parameters": {
        "type": "object",
        "properties": {"path": {"type": "string"}},
        "required": ["path"]
    }
}], tool_call_tags=tool_call_tags)

# All responses contain: <function_call>...JSON...</function_call>
```

### Production Monitoring with Events

```python
from abstractcore.events import EventType, on_global

# Monitor tool usage across different formats
def log_tool_calls(event):
    # Tool execution events are emitted when tools are executed (e.g., via ToolRegistry
    # or when using `execute_tools=True` (deprecated)).
    print(f"[TOOL EVENT] {event.type}: {event.data}")

on_global(EventType.TOOL_COMPLETED, log_tool_calls)

# Test with different formats
for format_name in ["qwen3", "llama3", "xml"]:
    llm = create_llm("ollama", model="qwen3:4b-instruct")
    response = llm.generate("Calculate 5 * 5", tools=[calculator_tool], tool_call_tags=format_name)
    print(f"{format_name} format result: {response.content[:100]}...")
```

**Key Benefits**:
- Per-call configuration: pass `tool_call_tags=...` when you need tool-call markup preserved/rewritten in `response.content`
- Real-time processing: No post-processing delays
- Streaming compatible: Works with streaming mode
- Format flexibility: Predefined formats plus custom tags

> **Related**: [Tool Call Syntax Rewriting Guide](tool-syntax-rewriting.md) | [Unified Streaming Architecture](architecture.md#unified-streaming-architecture)

## Structured Output Examples

> **Complete Guide**: [Structured Output Documentation](structured-output.md) - Native vs prompted strategies, provider support, schema design best practices

### User Profile Extraction

```python
from abstractcore import create_llm
from pydantic import BaseModel, field_validator
from typing import Optional

class UserProfile(BaseModel):
    name: str
    age: int
    email: str
    occupation: Optional[str] = None
    interests: list[str] = []

    @field_validator('age')
    @classmethod
    def validate_age(cls, v):
        if v < 0 or v > 150:
            raise ValueError('Age must be between 0 and 150')
        return v

    @field_validator('email')
    @classmethod
    def validate_email(cls, v):
        if '@' not in v:
            raise ValueError('Invalid email format')
        return v

llm = create_llm("openai", model="gpt-4o-mini")

# Text with user information
user_text = """
Hi, I'm Sarah Johnson, I'm 28 years old and work as a software engineer.
My email is sarah.johnson@techcorp.com. I love hiking, photography, and cooking.
"""

# Extract structured data with automatic validation
user = llm.generate(
    f"Extract user profile from: {user_text}",
    response_model=UserProfile
)

print(f"Name: {user.name}")
print(f"Age: {user.age}")
print(f"Email: {user.email}")
print(f"Occupation: {user.occupation}")
print(f"Interests: {', '.join(user.interests)}")
```

### Product Catalog Extraction

```python
from abstractcore import create_llm
from pydantic import BaseModel, field_validator
from typing import List
from enum import Enum

class ProductCategory(str, Enum):
    ELECTRONICS = "electronics"
    CLOTHING = "clothing"
    BOOKS = "books"
    HOME = "home"
    SPORTS = "sports"

class Product(BaseModel):
    name: str
    price: float
    category: ProductCategory
    description: str
    in_stock: bool = True

    @field_validator('price')
    @classmethod
    def validate_price(cls, v):
        if v <= 0:
            raise ValueError('Price must be positive')
        return v

class ProductCatalog(BaseModel):
    products: List[Product]
    total_count: int

    @field_validator('total_count')
    @classmethod
    def validate_count(cls, v, info):
        products = info.data.get('products', [])
        if v != len(products):
            raise ValueError(f'Total count {v} does not match products length {len(products)}')
        return v

llm = create_llm("anthropic", model="claude-haiku-4-5")

catalog_text = """
Our store has these items:
1. Gaming Laptop - $1299.99 - High-performance laptop for gaming and work
2. Wireless Headphones - $199.99 - Noise-cancelling bluetooth headphones
3. Python Programming Book - $49.99 - Complete guide to Python programming
4. Coffee Maker - $89.99 - Automatic drip coffee maker, currently out of stock
"""

catalog = llm.generate(
    f"Extract product catalog from: {catalog_text}",
    response_model=ProductCatalog
)

print(f"Total products: {catalog.total_count}")
for product in catalog.products:
    status = "In Stock" if product.in_stock else "Out of Stock"
    print(f"- {product.name}: ${product.price} ({product.category}) - {status}")
```

### Code Review Analysis

```python
from abstractcore import create_llm
from pydantic import BaseModel
from typing import List
from enum import Enum

class Severity(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class CodeIssue(BaseModel):
    line_number: int
    severity: Severity
    issue_type: str
    description: str
    suggestion: str

class CodeReview(BaseModel):
    language: str
    overall_quality: str
    issues: List[CodeIssue]
    recommendations: List[str]

llm = create_llm("ollama", model="qwen3:4b-instruct")

code_to_review = '''
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

def process_data(data):
    if data == None:
        return []
    result = []
    for item in data:
        result.append(item * 2)
    return result
'''

review = llm.generate(
    f"Review this Python code for issues:\n{code_to_review}",
    response_model=CodeReview
)

print(f"Language: {review.language}")
print(f"Overall Quality: {review.overall_quality}")
print(f"\nIssues Found ({len(review.issues)}):")
for issue in review.issues:
    print(f"  Line {issue.line_number}: [{issue.severity.upper()}] {issue.issue_type}")
    print(f"    Problem: {issue.description}")
    print(f"    Fix: {issue.suggestion}\n")

print("Recommendations:")
for rec in review.recommendations:
    print(f"  - {rec}")
```

## Streaming Examples

### Basic Streaming (Unified 2025)

```python
# Streaming uses a unified processor across providers (exact chunking depends on the backend)
from abstractcore import create_llm

llm = create_llm("anthropic", model="claude-haiku-4-5")

print("AI Story Generator: ", end="", flush=True)
for chunk in llm.generate(
    "Write a short story about a programmer who discovers their code is alive",
    stream=True
):
    print(chunk.content or "", end="", flush=True)
print("\n")
```

### Advanced Streaming with Progress and Performance Tracking

```python
from abstractcore import create_llm
import time

def streaming_with_insights(prompt):
    # Supports any provider: OpenAI, Anthropic, Ollama, MLX
    llm = create_llm("openai", model="gpt-4o-mini")

    print("Generating response...")

    start_time = time.time()
    chunks = []

    print("Response: ", end="", flush=True)
    for chunk in llm.generate(prompt, stream=True):
        chunks.append(chunk)
        print(chunk.content or "", end="", flush=True)

        # Optional real-time performance insights
        if len(chunks) % 10 == 0:
            current_time = time.time() - start_time
            chars_generated = sum(len(c.content or "") for c in chunks)
            print(f"\n[PROGRESS] {len(chunks)} chunks, {chars_generated} chars, {current_time:.1f}s")

    # Final performance summary
    total_time = time.time() - start_time
    total_chars = sum(len(chunk.content or "") for chunk in chunks)

    print(f"\n\n[STATS] Streaming Performance:")
    print(f"- Total Chunks: {len(chunks)}")
    print(f"- Total Characters: {total_chars}")
    print(f"- Duration: {total_time:.2f}s")
    print(f"- Speed: {total_chars/total_time:.0f} chars/sec")

# Usage with various prompts
streaming_with_insights("Explain quantum computing in simple terms")
```

### Real-Time Streaming with Tools (Unified Implementation)

```python
from abstractcore import create_llm
from datetime import datetime

def get_current_time() -> str:
    """Get the current time."""
    return datetime.now().strftime("%H:%M:%S")

def get_weather(city: str) -> str:
    """Get current weather for a city."""
    weather_data = {
        "New York": "Sunny, 22°C",
        "London": "Cloudy, 15°C",
        "Tokyo": "Partly cloudy, 25°C"
    }
    return weather_data.get(city, f"Weather data unavailable for {city}")

time_tool = {
    "name": "get_current_time",
    "description": "Get the current time",
    "parameters": {"type": "object", "properties": {}}
}

weather_tool = {
    "name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "Name of the city"}
        }
    }
}

# Works similarly across providers (exact chunking depends on the backend)
llm = create_llm("ollama", model="qwen3:4b-instruct")

print("AI Assistant: ", end="", flush=True)
for chunk in llm.generate(
    "What time is it right now? And can you tell me the weather in New York?",
    tools=[time_tool, weather_tool],
    stream=True
):
    # Real-time chunk processing and tool call detection
    print(chunk.content or "", end="", flush=True)

    # Tool calls are surfaced as structured dicts; execute them in your host/runtime.
    if chunk.tool_calls:
        print(f"\n[TOOL] Tool calls: {chunk.tool_calls}")

print("\n")  # Newline after streaming

# Notes:
# - Real-time tool call detection
# - Streams chunks as they arrive (minimal buffering)
# - Works with OpenAI, Anthropic, Ollama, MLX (provider-dependent details)
```

### Performance-Optimized Streaming

```python
from abstractcore import create_llm
import time

def compare_providers(prompt):
    """Compare streaming performance across providers."""
    providers = [
        ("openai", "gpt-4o-mini"),
        ("anthropic", "claude-haiku-4-5"),
        ("ollama", "qwen3:4b-instruct")
    ]

    for provider, model in providers:
        try:
            llm = create_llm(provider, model=model)

            print(f"\n[TEST] {provider.upper()} - {model}")
            start_time = time.time()

            chunks = []
            for chunk in llm.generate(prompt, stream=True):
                chunks.append(chunk)
                print(chunk.content or "", end="", flush=True)

            total_time = time.time() - start_time
            total_chars = sum(len(chunk.content or "") for chunk in chunks)

            print(f"\n\n[PERF] {provider.upper()} Performance:")
            print(f"- Chunks: {len(chunks)}")
            print(f"- Characters: {total_chars}")
            print(f"- Duration: {total_time:.2f}s")
            print(f"- Speed: {total_chars/total_time:.0f} chars/sec")

        except Exception as e:
            print(f"[ERROR] {provider} failed: {e}")

# Compare streaming performance
compare_providers("Write a creative short story about artificial intelligence")
```

**Streaming Features**:
- Time-to-first-token depends on provider/model/network
- Unified strategy across all providers
- Real-time tool call detection
- Streams chunks as they arrive (minimal buffering)
- Supports: OpenAI, Anthropic, Ollama, MLX, LMStudio, HuggingFace
- Robust error handling for malformed responses

## Session Management

### Basic Conversation

```python
from abstractcore import create_llm, BasicSession

llm = create_llm("openai", model="gpt-4o-mini")
session = BasicSession(
    provider=llm,
    system_prompt="You are a helpful coding tutor. Always provide examples."
)

# Multi-turn conversation
print("=== Conversation Start ===")

response1 = session.generate("Hi, I'm learning Python. What are decorators?")
print("User: Hi, I'm learning Python. What are decorators?")
print(f"AI: {response1.content}\n")

response2 = session.generate("Can you show me a practical example?")
print("User: Can you show me a practical example?")
print(f"AI: {response2.content}\n")

response3 = session.generate("What was my first question?")
print("User: What was my first question?")
print(f"AI: {response3.content}\n")

print(f"Total messages in conversation: {len(session.messages)}")
```

### Session Persistence

```python
from abstractcore import create_llm, BasicSession
from pathlib import Path

# Create and use session
llm = create_llm("anthropic", model="claude-haiku-4-5")
session = BasicSession(
    provider=llm,
    system_prompt="You are a travel advisor. Help plan trips."
)

# Have a conversation
session.generate("I want to plan a trip to Japan")
session.generate("I'm interested in both modern cities and traditional culture")
session.generate("My budget is around $3000 for 10 days")

# Save session
session_file = Path("travel_planning_session.json")
session.save(session_file)
print(f"Session saved to {session_file}")

# Later: Load session and continue
new_session = BasicSession.load(session_file, provider=llm)
response = new_session.generate("What were we discussing?")
print(f"AI remembers: {response.content}")

# Clean up
session_file.unlink()  # Delete the file
```

### Context Management

```python
from abstractcore import create_llm, BasicSession

def create_coding_assistant():
    """Create a specialized coding assistant session."""
    llm = create_llm("ollama", model="qwen3:4b-instruct")

    system_prompt = """
    You are an expert Python coding assistant. For each request:
    1. Provide working code examples
    2. Explain the code clearly
    3. Mention potential issues or improvements
    4. Keep responses concise but complete
    """

    return BasicSession(provider=llm, system_prompt=system_prompt)

# Usage
assistant = create_coding_assistant()

# The assistant will remember the context throughout the conversation
assistant.generate("I need a function to validate email addresses")
assistant.generate("Now add logging to that function")
assistant.generate("How would I test this function?")

print(f"Conversation history: {len(assistant.messages)} messages")

# Clear history but keep system prompt
assistant.clear_history()
print(f"After clearing: {len(assistant.messages)} messages")  # Just system prompt remains
```

## Interaction Tracing (Observability)

### Basic Tracing

Enable tracing to capture complete LLM interaction history for debugging and transparency:

```python
from abstractcore import create_llm

# Enable tracing on provider
llm = create_llm(
    'openai',
    model='gpt-4o-mini',
    enable_tracing=True,
    max_traces=100  # Keep last 100 interactions (ring buffer)
)

# Generate with custom metadata
response = llm.generate(
    "Explain quantum computing",
    temperature=0.7,
    trace_metadata={
        'user_id': 'user_123',
        'session_type': 'educational',
        'topic': 'quantum_physics'
    }
)

# Access trace by ID
trace_id = response.metadata['trace_id']
trace = llm.get_traces(trace_id=trace_id)

print(f"Trace ID: {trace['trace_id']}")
print(f"Timestamp: {trace['timestamp']}")
print(f"Prompt: {trace['prompt']}")
print(f"Response: {trace['response']['content'][:100]}...")
print(f"Tokens: {trace['response']['usage']['total_tokens']}")
print(f"Time: {trace['response']['generation_time_ms']:.2f}ms")
print(f"Custom metadata: {trace['metadata']}")
```

### Session-Level Tracing

Automatically track all interactions in a session with correlation:

```python
from abstractcore import create_llm
from abstractcore.core.session import BasicSession

llm = create_llm('openai', model='gpt-4o-mini', enable_tracing=True)
session = BasicSession(provider=llm, enable_tracing=True)

# All interactions automatically traced
session.generate("What is Python?")
session.generate("Give me an example")
session.generate("Explain list comprehensions")

# Get all session traces
traces = session.get_interaction_history()

print(f"\nSession ID: {session.id}")
print(f"Total interactions: {len(traces)}")

for i, trace in enumerate(traces, 1):
    print(f"\nInteraction {i}:")
    print(f"  Prompt: {trace['prompt']}")
    print(f"  Tokens: {trace['response']['usage']['total_tokens']}")
    print(f"  Time: {trace['response']['generation_time_ms']:.0f}ms")
    print(f"  Session ID: {trace['metadata']['session_id']}")
```

### Multi-Step Workflow with Retries

Track code generation workflows with retry attempts:

```python
from abstractcore import create_llm
from abstractcore.core.session import BasicSession

llm = create_llm('openai', model='gpt-4o-mini', enable_tracing=True)
session = BasicSession(provider=llm, enable_tracing=True)

# Step 1: Generate code
response = session.generate(
    "Write a Python function to calculate fibonacci numbers",
    system_prompt="You are a Python code generator. Only output code.",
    step_type='code_generation',
    attempt_number=1,
    temperature=0
)

code = response.content
success = False

# Step 2-4: Execute with retry logic
for attempt in range(1, 4):
    try:
        exec(code)  # Simulate execution
        success = True
        break
    except Exception as e:
        # Retry with error context
        response = session.generate(
            f"Previous code failed: {e}. Fix it.",
            step_type='code_generation',
            attempt_number=attempt + 1,
            temperature=0
        )
        code = response.content

# Get workflow summary
traces = session.get_interaction_history()

print(f"\nWorkflow Summary:")
print(f"Total attempts: {len(traces)}")
print(f"Final status: {'Success' if success else 'Failed'}")

for trace in traces:
    step = trace['metadata']['step_type']
    attempt = trace['metadata']['attempt_number']
    tokens = trace['response']['usage']['total_tokens']
    print(f"  {step} (Attempt {attempt}): {tokens} tokens")
```

### Export Traces

Export traces to different formats for analysis:

```python
from abstractcore import create_llm
from abstractcore.utils import export_traces, summarize_traces

llm = create_llm('openai', model='gpt-4o-mini', enable_tracing=True)

# Generate some interactions
for i in range(5):
    llm.generate(f"Question {i+1}", temperature=0)

traces = llm.get_traces()

# Export to JSONL (one JSON per line)
export_traces(traces, format='jsonl', file_path='traces.jsonl')

# Export to pretty JSON
export_traces(traces, format='json', file_path='traces.json')

# Export to Markdown report
export_traces(traces, format='markdown', file_path='trace_report.md')

# Get summary statistics
summary = summarize_traces(traces)
print(f"\nSummary:")
print(f"  Total interactions: {summary['total_interactions']}")
print(f"  Total tokens: {summary['total_tokens']}")
print(f"  Average tokens: {summary['avg_tokens_per_interaction']:.0f}")
print(f"  Total time: {summary['total_time_ms']:.2f}ms")
print(f"  Average time: {summary['avg_time_ms']:.2f}ms")
print(f"  Providers: {summary['providers']}")
print(f"  Models: {summary['models']}")
```

### Retrieve Specific Traces

Different ways to retrieve traces:

```python
from abstractcore import create_llm

llm = create_llm('openai', model='gpt-4o-mini', enable_tracing=True)

# Generate some interactions
for i in range(10):
    llm.generate(f"Test {i}", temperature=0)

# Get all traces
all_traces = llm.get_traces()
print(f"Total traces: {len(all_traces)}")

# Get last 5 traces
recent = llm.get_traces(last_n=5)
print(f"Last 5 prompts: {[t['prompt'] for t in recent]}")

# Get specific trace by ID
response = llm.generate("Specific query", temperature=0)
trace_id = response.metadata['trace_id']
trace = llm.get_traces(trace_id=trace_id)
print(f"Specific trace: {trace['prompt']}")
```

[Learn more about Interaction Tracing](interaction-tracing.md)

## Production Patterns

### Retry and Error Handling

```python
from abstractcore import create_llm
from abstractcore.core.retry import RetryConfig
from abstractcore.exceptions import ProviderAPIError, RateLimitError
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def create_production_llm():
    """Create LLM with production-grade retry configuration."""
    retry_config = RetryConfig(
        max_attempts=3,
        initial_delay=1.0,
        max_delay=30.0,
        use_jitter=True,
        failure_threshold=5
    )

    return create_llm(
        "openai",
        model="gpt-4o-mini",
        retry_config=retry_config,
        timeout=30
    )

def safe_generate(prompt, **kwargs):
    """Generate with comprehensive error handling."""
    llm = create_production_llm()

    try:
        logger.info(f"Generating response for prompt: {prompt[:50]}...")
        response = llm.generate(prompt, **kwargs)
        logger.info(f"Response generated successfully: {len(response.content)} chars")
        return response

    except RateLimitError as e:
        logger.warning(f"Rate limited: {e}")
        raise

    except ProviderAPIError as e:
        logger.error(f"API error: {e}")
        raise

    except Exception as e:
        logger.error(f"Unexpected error: {e}")
        raise

# Usage
try:
    response = safe_generate("What is machine learning?")
    print(response.content)
except Exception as e:
    print(f"Generation failed: {e}")
```

### Cost Monitoring

```python
from abstractcore import create_llm
from abstractcore.events import EventType, on_global
from datetime import datetime
import json

class CostMonitor:
    def __init__(self, budget_limit=10.0):
        self.total_cost = 0.0
        self.budget_limit = budget_limit
        self.requests = []

        # Register event handlers
        on_global(EventType.GENERATION_COMPLETED, self.track_cost)

    def track_cost(self, event):
        """Track costs from generation events."""
        cost = event.data.get("cost_usd")
        if cost:
            # NOTE: `cost_usd` is a best-effort estimate based on token usage.
            cost_f = float(cost)
            self.total_cost += cost_f
            self.requests.append({
                'timestamp': event.timestamp.isoformat(),
                'provider': event.data.get('provider'),
                'model': event.data.get('model'),
                'cost_usd': cost_f,
                'tokens_input': event.data.get('tokens_input'),
                'tokens_output': event.data.get('tokens_output')
            })

            print(f"[COST] ${cost_f:.4f} | Total: ${self.total_cost:.4f}")

            if self.total_cost > self.budget_limit:
                print(f"[WARN] BUDGET EXCEEDED: ${self.total_cost:.4f} > ${self.budget_limit}")

    def get_report(self):
        """Get cost report."""
        return {
            'total_cost': self.total_cost,
            'budget_limit': self.budget_limit,
            'total_requests': len(self.requests),
            'average_cost': self.total_cost / len(self.requests) if self.requests else 0,
            'requests': self.requests
        }

# Usage
monitor = CostMonitor(budget_limit=1.0)  # $1 budget

llm = create_llm("openai", model="gpt-4o-mini")

# Make some requests
for i in range(3):
    response = llm.generate(f"Tell me a fact about number {i+1}")
    print(f"Fact {i+1}: {response.content[:100]}...\n")

# Get report
report = monitor.get_report()
print(f"\n[REPORT] Final Cost Summary:")
print(f"Total cost: ${report['total_cost']:.4f}")
print(f"Requests: {report['total_requests']}")
print(f"Average per request: ${report['average_cost']:.4f}")
```

### Load Balancing

```python
from abstractcore import create_llm
import random
import time
from typing import List, Tuple

class LoadBalancer:
    def __init__(self, providers: List[Tuple[str, str]]):
        """Initialize with list of (provider, model) tuples."""
        self.providers = []
        self.weights = []

        for provider_name, model in providers:
            try:
                llm = create_llm(provider_name, model=model)
                self.providers.append((llm, provider_name, model))
                self.weights.append(1.0)  # Equal weight initially
                print(f"[OK] {provider_name} ({model}) ready")
            except Exception as e:
                print(f"[FAIL] {provider_name} ({model}) failed: {e}")

    def generate(self, prompt, **kwargs):
        """Generate using weighted random selection."""
        if not self.providers:
            raise Exception("No providers available")

        # Weighted random selection
        provider_data = random.choices(
            self.providers,
            weights=self.weights,
            k=1
        )[0]

        llm, provider_name, model = provider_data

        try:
            start_time = time.time()
            response = llm.generate(prompt, **kwargs)
            duration = time.time() - start_time

            print(f"[OK] {provider_name} responded in {duration:.2f}s")
            return response

        except Exception as e:
            print(f"[FAIL] {provider_name} failed: {e}")
            # Remove failed provider temporarily
            idx = self.providers.index(provider_data)
            self.weights[idx] *= 0.1  # Reduce weight dramatically
            raise

# Usage
balancer = LoadBalancer([
    ("openai", "gpt-4o-mini"),
    ("anthropic", "claude-haiku-4-5"),
    ("ollama", "qwen3:4b-instruct")
])

# Make requests - they'll be distributed across available providers
for i in range(5):
    try:
        response = balancer.generate(f"Tell me about topic number {i+1}")
        print(f"Response {i+1}: {response.content[:50]}...\n")
    except Exception as e:
        print(f"Request {i+1} failed: {e}\n")
```

## Integration Examples

### FastAPI Integration

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from abstractcore import create_llm, BasicSession
from typing import Optional
import uuid

app = FastAPI(title="AbstractCore API")

# Global LLM instance
llm = create_llm("openai", model="gpt-4o-mini")

# Store sessions in memory (use Redis in production)
sessions = {}

class ChatRequest(BaseModel):
    message: str
    session_id: Optional[str] = None
    system_prompt: Optional[str] = None

class ChatResponse(BaseModel):
    response: str
    session_id: str

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    try:
        # Get or create session
        if request.session_id and request.session_id in sessions:
            session = sessions[request.session_id]
        else:
            session_id = request.session_id or str(uuid.uuid4())
            session = BasicSession(
                provider=llm,
                system_prompt=request.system_prompt or "You are a helpful assistant."
            )
            sessions[session_id] = session

        # Generate response
        response = session.generate(request.message)

        return ChatResponse(
            response=response.content,
            session_id=session_id
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.delete("/sessions/{session_id}")
async def clear_session(session_id: str):
    if session_id in sessions:
        del sessions[session_id]
        return {"message": "Session cleared"}
    raise HTTPException(status_code=404, detail="Session not found")

# Run with: uvicorn main:app --reload
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

### Gradio Web Interface

```python
import gradio as gr
from abstractcore import create_llm, BasicSession
from typing import List, Tuple

class ChatInterface:
    def __init__(self):
        self.llm = create_llm("anthropic", model="claude-haiku-4-5")
        self.session = BasicSession(
            provider=self.llm,
            system_prompt="You are a helpful AI assistant."
        )

    def chat(self, message: str, history: List[Tuple[str, str]]) -> Tuple[str, List[Tuple[str, str]]]:
        """Handle chat interaction."""
        try:
            response = self.session.generate(message)
            history.append((message, response.content))
            return "", history
        except Exception as e:
            history.append((message, f"Error: {str(e)}"))
            return "", history

    def clear(self) -> Tuple[str, List]:
        """Clear conversation history."""
        self.session.clear_history()
        return "", []

# Create interface
chat_interface = ChatInterface()

with gr.Blocks(title="AbstractCore Chat") as demo:
    gr.Markdown("# AbstractCore Chat Interface")

    chatbot = gr.Chatbot(label="Conversation", height=400)
    msg = gr.Textbox(
        label="Message",
        placeholder="Type your message here...",
        lines=2
    )

    with gr.Row():
        submit = gr.Button("Send", variant="primary")
        clear = gr.Button("Clear", variant="secondary")

    # Event handlers
    msg.submit(
        chat_interface.chat,
        inputs=[msg, chatbot],
        outputs=[msg, chatbot]
    )

    submit.click(
        chat_interface.chat,
        inputs=[msg, chatbot],
        outputs=[msg, chatbot]
    )

    clear.click(
        chat_interface.clear,
        outputs=[msg, chatbot]
    )

if __name__ == "__main__":
    demo.launch(share=True)
```

### Jupyter Notebook Integration

```python
# Cell 1: Setup
from abstractcore import create_llm
from IPython.display import display, Markdown, HTML
import json

# Create LLM instance
llm = create_llm("openai", model="gpt-4o-mini")

def display_response(response, title="AI Response"):
    """Pretty display for Jupyter notebooks."""
    html = f"""
    <div style="border: 1px solid #ddd; padding: 15px; margin: 10px 0; border-radius: 5px;">
        <h4 style="color: #333; margin-top: 0;">{title}</h4>
        <p style="line-height: 1.6;">{response.content}</p>
    </div>
    """
    display(HTML(html))

print("AbstractCore setup complete!")

# Cell 2: Basic Usage
response = llm.generate("Explain quantum computing in simple terms")
display_response(response, "Quantum Computing Explanation")

# Cell 3: Structured Output
from pydantic import BaseModel
from typing import List

class LearningPlan(BaseModel):
    topic: str
    difficulty: str
    estimated_hours: int
    prerequisites: List[str]
    learning_steps: List[str]

plan = llm.generate(
    "Create a learning plan for someone who wants to learn machine learning",
    response_model=LearningPlan
)

# Display as nice table
display(HTML(f"""
<table style="border-collapse: collapse; width: 100%;">
    <tr><td><strong>Topic:</strong></td><td>{plan.topic}</td></tr>
    <tr><td><strong>Difficulty:</strong></td><td>{plan.difficulty}</td></tr>
    <tr><td><strong>Estimated Hours:</strong></td><td>{plan.estimated_hours}</td></tr>
    <tr><td><strong>Prerequisites:</strong></td><td>{', '.join(plan.prerequisites)}</td></tr>
</table>
"""))

display(Markdown("### Learning Steps:"))
for i, step in enumerate(plan.learning_steps, 1):
    display(Markdown(f"{i}. {step}"))
```

### Discord Bot Integration

```python
import discord
from discord.ext import commands
from abstractcore import create_llm, BasicSession
import asyncio

# Bot setup
intents = discord.Intents.default()
intents.message_content = True
bot = commands.Bot(command_prefix='!', intents=intents)

# LLM setup
llm = create_llm("anthropic", model="claude-haiku-4-5")
sessions = {}  # Store user sessions

@bot.event
async def on_ready():
    print(f'{bot.user} has connected to Discord!')

@bot.command(name='ask')
async def ask(ctx, *, question):
    """Ask the AI a question."""
    user_id = ctx.author.id

    # Get or create session for user
    if user_id not in sessions:
        sessions[user_id] = BasicSession(
            provider=llm,
            system_prompt="You are a helpful Discord bot assistant. Keep responses concise."
        )

    try:
        # Show typing indicator
        async with ctx.typing():
            response = sessions[user_id].generate(question)

        # Discord has a 2000 character limit
        content = response.content
        if len(content) > 2000:
            content = content[:1997] + "..."

        await ctx.reply(content)

    except Exception as e:
        await ctx.reply(f"Sorry, I encountered an error: {str(e)}")

@bot.command(name='clear')
async def clear_session(ctx):
    """Clear your conversation history."""
    user_id = ctx.author.id
    if user_id in sessions:
        sessions[user_id].clear_history()
        await ctx.reply("Your conversation history has been cleared!")
    else:
        await ctx.reply("You don't have an active session to clear.")

@bot.command(name='stats')
async def stats(ctx):
    """Show session statistics."""
    user_id = ctx.author.id
    if user_id in sessions:
        session = sessions[user_id]
        message_count = len(session.messages)
        await ctx.reply(f"Your session has {message_count} messages.")
    else:
        await ctx.reply("You don't have an active session.")

# Run bot (add your Discord bot token)
# bot.run('YOUR_DISCORD_BOT_TOKEN')
```

## Next Steps

These examples show AbstractCore's versatility across different use cases. To continue learning:

1. **Start with basics** - Try the simple Q&A examples
2. **Add tools** - Experiment with the tool calling examples
3. **Structure output** - Use Pydantic models for type-safe responses
4. **Go production** - Implement error handling and monitoring
5. **Build apps** - Use the integration examples as starting points

For more information:
- [Getting Started](getting-started.md) - Basic setup and usage
- [Capabilities](capabilities.md) - What AbstractCore can do
- [Prerequisites](prerequisites.md) - Provider setup and configuration
- [API Reference](api-reference.md) - Complete API documentation

---

**Remember**: All these examples work with any provider - just change the `create_llm()` call to switch between OpenAI, Anthropic, Ollama, MLX, and others!

---

### Inlined: `docs/mcp.md`

# MCP (Model Context Protocol)

AbstractCore treats MCP as a **tool-server protocol** (not an LLM provider).

The `abstractcore.mcp` module provides:
- a minimal MCP JSON-RPC client (Streamable HTTP) → `abstractcore.mcp.McpClient`
- a minimal MCP stdio client (spawn a subprocess) → `abstractcore.mcp.McpStdioClient`
- tool discovery (`tools/list`) and conversion into AbstractCore tool specs → `abstractcore.mcp.McpToolSource`

## What you can do today

### 1) Discover tools from an MCP server

```python
from abstractcore.mcp import McpClient, McpToolSource

client = McpClient(url="http://localhost:3000")  # MCP streamable HTTP endpoint
source = McpToolSource(server_id="local", client=client)
tools = source.list_tool_specs()
```

Each returned tool spec is an AbstractCore-compatible dict you can pass to `tools=[...]` in
`generate()`/`agenerate()`. Tool names are namespaced as:

`mcp::<server_id>::<tool_name>`

See `abstractcore/abstractcore/mcp/naming.py`.

### 2) Execute MCP tools in your host/runtime

AbstractCore’s default execution path is **passthrough** (`execute_tools=False`): the model can
request tool calls and you execute them in your host/runtime.

The built-in `abstractcore.tools.registry.execute_tools()` executes Python callables registered in
the (deprecated) global registry; it does **not** automatically route MCP tool calls. For MCP, your
host/runtime should detect names starting with `mcp::` and dispatch them to an MCP client.

```python
from abstractcore.mcp import McpClient, parse_namespaced_tool_name

client = McpClient(url="http://localhost:3000")

def execute_mcp_tool_call(call: dict) -> dict:
    parsed = parse_namespaced_tool_name(call.get("name", ""))
    if not parsed:
        raise ValueError("Not an MCP tool call")
    server_id, tool_name = parsed
    return client.call_tool(name=tool_name, arguments=call.get("arguments") or {})
```

## Transports supported

### Streamable HTTP

`McpClient` posts JSON-RPC to the server URL. It automatically sets an `Accept` header compatible
with streamable HTTP (`application/json, text/event-stream`) and will capture `MCP-Session-Id`
responses when provided.

See `abstractcore/abstractcore/mcp/client.py`.

### stdio

`McpStdioClient` spawns an MCP server subprocess and communicates over stdin/stdout with JSON-RPC,
including a best-effort initialization handshake.

See `abstractcore/abstractcore/mcp/stdio_client.py`.

## Configuration helpers

`create_mcp_client(config=...)` supports both HTTP and stdio config shapes:

```python
from abstractcore.mcp import create_mcp_client

client = create_mcp_client(config={"url": "http://localhost:3000"})
client = create_mcp_client(config={"transport": "stdio", "command": ["my-mcp-server", "--stdio"]})
```

See `abstractcore/abstractcore/mcp/factory.py`.

## Current limitations

- MCP is currently a **library-level** integration (tool discovery + clients). AbstractCore’s HTTP
  server does not expose MCP management endpoints.
- Tool execution routing for `mcp::...` names is host/runtime responsibility.

---

### Inlined: `docs/structured-logging.md`

# Structured Logging

AbstractCore uses Python logging throughout the library. You can control console verbosity and optional file logging via the centralized config CLI.

Default behavior (no overrides): **console shows only ERROR and above**.

## Configure with the CLI

```bash
# Show current config (including logging)
abstractcore --status

# Console verbosity
abstractcore --set-console-log-level DEBUG
abstractcore --set-console-log-level INFO
abstractcore --set-console-log-level WARNING
abstractcore --set-console-log-level ERROR
abstractcore --set-console-log-level NONE

# File logging (disabled by default)
abstractcore --enable-file-logging
abstractcore --disable-file-logging
abstractcore --set-log-base-dir ~/.abstractcore/logs

# Convenience
abstractcore --enable-debug-logging
abstractcore --disable-console-logging
```

Logging defaults live in `~/.abstractcore/config/abstractcore.json`. See [Centralized Config](centralized-config.md) for the schema.

## Verbatim capture (prompts/responses)

Some components can capture full prompts and responses in logs/traces. This is controlled by `verbatim_enabled` in the centralized config file (`~/.abstractcore/config/abstractcore.json`). Disable it if you may handle sensitive data.

## In-code usage

```python
from abstractcore.utils.structured_logging import get_logger

logger = get_logger(__name__)
logger.info("startup", component="my_app", version="1.0.0")
```

---

### Inlined: `docs/api-reference.md`

# API Reference

Complete reference for the AbstractCore API. All examples work across any provider.

## Table of Contents

- [Core Functions](#core-functions)
- [Classes](#classes)
  - [AbstractCoreInterface](#abstractcoreinterface)
    - [generate()](#generate)
    - [agenerate()](#agenerate)
  - [BasicSession](#basicsession)
    - [generate()](#generate-1)
    - [agenerate()](#agenerate-1)
- [Event System](#event-system)
- [Retry Configuration](#retry-configuration)
- [Embeddings](#embeddings)
- [Exceptions](#exceptions)

## Core Functions

### create_llm()

Creates an LLM provider instance.

```python
def create_llm(
    provider: str,
    model: Optional[str] = None,
    retry_config: Optional[RetryConfig] = None,
    **kwargs
) -> AbstractCoreInterface
```

**Parameters:**
- `provider` (str): Provider name ("openai", "anthropic", "ollama", "mlx", "lmstudio", "huggingface")
- `model` (str, optional): Model name. If not provided, uses provider default
- `retry_config` (RetryConfig, optional): Custom retry configuration
- `**kwargs`: Provider-specific parameters

**Provider-specific parameters:**
- `api_key` (str): API key for cloud providers
- `base_url` (str): Custom endpoint URL
- `temperature` (float): Sampling temperature (0.0-1.0, controls creativity)
- `seed` (int): Random seed for deterministic outputs (✅ OpenAI, Ollama, MLX, HuggingFace, LMStudio; ⚠️ Anthropic issues warning)
- `max_tokens` (int): Maximum output tokens
- `timeout` (int): Request timeout in seconds
- `top_p` (float): Nucleus sampling parameter

**Returns:** AbstractCoreInterface instance

**Example:**
```python
from abstractcore import create_llm

# Basic usage
llm = create_llm("openai", model="gpt-4o-mini")

# With configuration
llm = create_llm(
    "anthropic",
    model="claude-haiku-4-5",
    temperature=0.7,
    max_tokens=1000,
    timeout=30
)

# Local provider
llm = create_llm("ollama", model="qwen2.5-coder:7b", base_url="http://localhost:11434")
```

## Classes

### AbstractCoreInterface

Base interface for all LLM providers. All providers implement this interface.

#### generate()

Generate text response from the LLM.

```python
def generate(
    self,
    prompt: str,
    messages: Optional[List[Dict]] = None,
    system_prompt: Optional[str] = None,
    tools: Optional[List[Dict]] = None,
    response_model: Optional[BaseModel] = None,
    retry_strategy: Optional[Retry] = None,
    stream: bool = False,
    thinking: Optional[bool | str] = None,
    **kwargs
) -> Union[GenerateResponse, Iterator[GenerateResponse]]
```

**Parameters:**
- `prompt` (str): Text prompt to generate from
- `messages` (List[Dict], optional): Conversation messages in OpenAI format
- `system_prompt` (str, optional): System prompt to set context
- `tools` (List[Dict], optional): Tools the LLM can call
- `response_model` (BaseModel, optional): Pydantic model for structured output
- `retry_strategy` (Retry, optional): Custom retry strategy for structured output
- `stream` (bool): Enable streaming response
- `thinking` (bool | str, optional): Unified thinking/reasoning control (`"auto"|"on"|"off"|"none"` or `"low"|"medium"|"high"|"xhigh"` when supported). Note: `"none"` is treated as an alias for `"off"`.
- `**kwargs`: Additional generation parameters

**Returns:**
- If `stream=False`: GenerateResponse
- If `stream=True`: Iterator[GenerateResponse]

**Examples:**

**Basic Generation:**
```python
response = llm.generate("What is machine learning?")
print(response.content)
```

**With System Prompt:**
```python
response = llm.generate(
    "Explain Python decorators",
    system_prompt="You are a Python expert. Always provide code examples."
)
```

**Structured Output:**
```python
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

person = llm.generate(
    "Extract: John Doe is 25 years old",
    response_model=Person
)
print(f"{person.name}, age {person.age}")
```

> **See**: [Structured Output Guide](structured-output.md) for comprehensive documentation

**Tool Calling:**
```python
def get_weather(city: str) -> str:
    return f"Weather in {city}: sunny, 22°C"

tools = [{
    "name": "get_weather",
    "description": "Get weather for a city",
    "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
    }
}]

response = llm.generate("What's the weather in Paris?", tools=tools)
```

**Streaming:**
```python
print("AI: ", end="")
for chunk in llm.generate(
    "Create a Python function with a tool",
    stream=True,
    tools=tools
):
    # Real-time chunk processing
    print(chunk.content or "", end="", flush=True)

    # Tool calls are surfaced as structured dicts; execute them in your host/runtime.
    if chunk.tool_calls:
        print(f"\nTool calls: {chunk.tool_calls}")
```

**Streaming notes**:
- Streaming uses a unified processor across providers; exact chunking behavior depends on the backend.
- Tool calls are surfaced as structured dicts in `chunk.tool_calls`; execute them in your host/runtime (pass-through by default).
- If you need tool-call markup preserved/re-written in `chunk.content`, pass `tool_call_tags=...` (see [Tool Call Syntax Rewriting](tool-syntax-rewriting.md)).
- In streaming mode, AbstractCore records a best-effort TTFT metric in `chunk.metadata["_timing"]["ttft_ms"]` when available (for debugging/observability).

#### agenerate()

Async version of `generate()` for concurrent request execution.

```python
async def agenerate(
    self,
    prompt: str,
    messages: Optional[List[Dict]] = None,
    system_prompt: Optional[str] = None,
    tools: Optional[List[Dict]] = None,
    response_model: Optional[BaseModel] = None,
    stream: bool = False,
    **kwargs
) -> Union[GenerateResponse, AsyncIterator[GenerateResponse]]
```

**Parameters:** Same as `generate()`

**Returns:**
- If `stream=False`: GenerateResponse
- If `stream=True`: AsyncIterator[GenerateResponse]

**Examples:**

**Basic Async:**
```python
import asyncio

async def main():
    response = await llm.agenerate("What is quantum computing?")
    print(response.content)

asyncio.run(main())
```

**Concurrent Requests:**
```python
async def batch_process():
    tasks = [
        llm.agenerate("Summarize Python"),
        llm.agenerate("Summarize JavaScript"),
        llm.agenerate("Summarize Rust")
    ]
    responses = await asyncio.gather(*tasks)

    for response in responses:
        print(response.content)

asyncio.run(batch_process())
```

**Async Streaming:**
```python
async def stream_response():
    async for chunk in llm.agenerate("Tell me a story", stream=True):
        print(chunk.content, end='', flush=True)

asyncio.run(stream_response())
```

**Multi-Provider Comparison:**
```python
async def compare_providers():
    openai = create_llm("openai", model="gpt-4o-mini")
    claude = create_llm("anthropic", model="claude-haiku-4-5")

    responses = await asyncio.gather(
        openai.agenerate("What is 2+2?"),
        claude.agenerate("What is 2+2?")
    )

    print(f"OpenAI: {responses[0].content}")
    print(f"Claude: {responses[1].content}")

asyncio.run(compare_providers())
```

**Features:**
- Works across AbstractCore providers (cloud + local); some use native async, others fall back to `asyncio.to_thread()`
- Faster batch operations via concurrent execution (depends on provider, network, and hardware)
- Full streaming support with AsyncIterator
- Compatible with FastAPI and async web frameworks
- Zero breaking changes to sync API

#### get_capabilities()

Get provider capabilities.

```python
def get_capabilities(self) -> List[str]
```

**Returns:** List of capability strings

**Example:**
```python
capabilities = llm.get_capabilities()
print(capabilities)  # ['text_generation', 'tool_calling', 'streaming', 'vision']
```

#### unload_model(model_name)

Unload/cleanup resources for a specific model (best-effort).

```python
def unload_model(self, model_name: str) -> None
```

For local providers (Ollama, MLX, HuggingFace, LMStudio), this explicitly frees model memory or releases client resources. For API providers (OpenAI, Anthropic), this is typically a no-op but safe to call.

**Provider-specific behavior:**
- **Ollama**: Uses native `keep_alive` load/unload semantics and `/api/ps` for residency truth
- **MLX**: Clears model/tokenizer references and forces garbage collection
- **HuggingFace**: Closes llama.cpp resources (GGUF) or clears model references
- **LMStudio**: Uses native loaded-instance REST load/unload when available
- **OpenAI/Anthropic**: No-op (safe to call)

`get_model_residency(...)` reports verified loaded state only when the provider can
check the backing runtime. Client construction, configured defaults, and model catalogs
are not treated as loaded-model proof.

**Example:**
```python
# Load and use a large model
llm = create_llm("ollama", model="qwen3-coder:30b")
response = llm.generate("Hello world")

# Explicitly free memory when done
llm.unload_model(llm.model)
del llm

# Now safe to load another large model
llm2 = create_llm("mlx", model="mlx-community/Qwen3-30B-4bit")
```

**Use cases:**
- Test suites testing multiple models sequentially
- Memory-constrained environments (<32GB RAM)
- Sequential model loading in production systems

### GenerateResponse

Response object from LLM generation with **consistent token terminology** and **generation time tracking**.

```python
@dataclass
class GenerateResponse:
    content: Optional[str]
    raw_response: Any
    model: Optional[str]
    finish_reason: Optional[str]
    usage: Optional[Dict[str, int]]
    tool_calls: Optional[List[Dict]]
    metadata: Optional[Dict]
    gen_time: Optional[float]  # Generation time in milliseconds
    
    # Consistent token access properties
    @property
    def input_tokens(self) -> Optional[int]:
        """Get input tokens with consistent terminology."""
        
    @property
    def output_tokens(self) -> Optional[int]:
        """Get output tokens with consistent terminology."""
        
    @property
    def total_tokens(self) -> Optional[int]:
        """Get total tokens."""
```

**Attributes:**
- `content` (str): Generated text content
- `raw_response` (Any): Raw provider response
- `model` (str): Model used for generation
- `finish_reason` (str): Why generation stopped ("stop", "length", "tool_calls")
- `usage` (Dict): Token usage information
- `tool_calls` (List[Dict]): Tools called by the LLM
- `metadata` (Dict): Additional metadata (notably `metadata["reasoning"]` when a provider/model exposes thinking/reasoning)
- `gen_time` (float): Generation time in milliseconds, rounded to 1 decimal place

**Token and Timing Access Examples:**
```python
response = llm.generate("Explain quantum computing")

# Best-effort access across supported providers (may be None depending on backend/config)
print(f"Input tokens: {response.input_tokens}")      # None if usage isn't reported/estimated
print(f"Output tokens: {response.output_tokens}")    # None if usage isn't reported/estimated
print(f"Total tokens: {response.total_tokens}")      # None if usage isn't reported/estimated
print(f"Generation time: {response.gen_time}ms")     # None if timing wasn't captured

# Comprehensive summary
print(f"Summary: {response.get_summary()}")  # Model | Tokens | Time | Tools

# Raw usage dictionary (provider-specific format)
print(f"Usage details: {response.usage}")
```

**Token Count Sources:**
- **Provider APIs**: OpenAI, Anthropic, LMStudio (native API token counts)
- **AbstractCore Calculation**: MLX, HuggingFace (using `token_utils.py`)
- **Mixed Sources**: Ollama (combination of provider and calculated tokens)

**Backward Compatibility**: Legacy `prompt_tokens` and `completion_tokens` keys remain available in `response.usage` dictionary.

**Methods:**

#### has_tool_calls()
```python
def has_tool_calls(self) -> bool
```
Returns True if the response contains tool calls.

#### get_tools_executed()
```python
def get_tools_executed(self) -> List[str]
```
Returns list of tool names that were executed.

**Example:**
```python
response = llm.generate("What's 2+2?", tools=[calculator_tool])

print(f"Content: {response.content}")
print(f"Model: {response.model}")
print(f"Tokens: {response.usage}")

if response.has_tool_calls():
    print(f"Tools used: {response.get_tools_executed()}")
```

### BasicSession

Manages conversation context and history.

```python
class BasicSession:
    def __init__(
        self,
        provider: AbstractCoreInterface,
        system_prompt: Optional[str] = None,
        temperature: Optional[float] = None,
        seed: Optional[int] = None,
        **kwargs
    ):
```

**Parameters:**
- `provider` (AbstractCoreInterface): LLM provider instance
- `system_prompt` (str, optional): System prompt for the conversation
- `temperature` (float, optional): Default temperature for all generations (0.0-1.0)
- `seed` (int, optional): Default seed for deterministic outputs (provider support varies)
- `**kwargs`: Additional session parameters (tools, timeouts, etc.)

**Attributes:**
- `messages` (List[Message]): Conversation history
- `provider` (AbstractCoreInterface): LLM provider
- `system_prompt` (str): System prompt

**Methods:**

#### generate()
```python
def generate(self, prompt: str, **kwargs) -> GenerateResponse
```
Generate response and add to conversation history.

#### agenerate()
```python
async def agenerate(
    self,
    prompt: str,
    name: Optional[str] = None,
    location: Optional[str] = None,
    **kwargs
) -> Union[GenerateResponse, AsyncIterator[GenerateResponse]]
```
Async version of `generate()`. Maintains conversation history with async execution.

**Example:**
```python
import asyncio

async def chat():
    session = BasicSession(provider=llm)

    # Async conversation
    response1 = await session.agenerate("My name is Alice")
    response2 = await session.agenerate("What's my name?")

    print(response2.content)  # References Alice

asyncio.run(chat())
```

#### add_message()
```python
def add_message(self, role: str, content: str, **metadata) -> Message
```
Add message to conversation history.

#### clear_history()
```python
def clear_history(self, keep_system: bool = True) -> None
```
Clear conversation history, optionally keeping system prompt.

#### save()
```python
def save(self, filepath: Path) -> None
```
Save session to JSON file.

#### load()
```python
@classmethod
def load(cls, filepath: Path, provider: AbstractCoreInterface) -> "BasicSession"
```
Load session from JSON file.

**Example:**
```python
from abstractcore import create_llm, BasicSession

llm = create_llm("openai", model="gpt-4o-mini")
session = BasicSession(
    provider=llm,
    system_prompt="You are a helpful coding tutor.",
    temperature=0.3,  # Focused responses
    seed=42          # Consistent outputs
)

# Multi-turn conversation
response1 = session.generate("What are Python decorators?")
response2 = session.generate("Show me an example", temperature=0.7)  # Override for this call

print(f"Conversation has {len(session.messages)} messages")

# Save session
session.save(Path("conversation.json"))

# Load later
loaded_session = BasicSession.load(Path("conversation.json"), llm)
```

### Message

Represents a conversation message.

```python
@dataclass
class Message:
    role: str
    content: str
    timestamp: Optional[datetime] = None
    name: Optional[str] = None
    metadata: Optional[Dict] = None
```

**Methods:**

#### to_dict()
```python
def to_dict(self) -> Dict
```
Convert message to dictionary.

#### from_dict()
```python
@classmethod
def from_dict(cls, data: Dict) -> "Message"
```
Create message from dictionary.

## Event System

### EventType

Available event types for monitoring.

```python
class EventType(Enum):
    # Generation events
    GENERATION_STARTED = "generation_started"
    GENERATION_COMPLETED = "generation_completed"

    # Tool events
    TOOL_STARTED = "tool_started"
    TOOL_PROGRESS = "tool_progress"
    TOOL_COMPLETED = "tool_completed"

    # Error handling
    ERROR = "error"

    # Retry and resilience events
    RETRY_ATTEMPTED = "retry_attempted"
    RETRY_EXHAUSTED = "retry_exhausted"

    # Useful events
    VALIDATION_FAILED = "validation_failed"
    SESSION_CREATED = "session_created"
    SESSION_CLEARED = "session_cleared"
    COMPACTION_STARTED = "compaction_started"
    COMPACTION_COMPLETED = "compaction_completed"

    # Runtime/workflow events
    WORKFLOW_STEP_STARTED = "workflow_step_started"
    WORKFLOW_STEP_COMPLETED = "workflow_step_completed"
    WORKFLOW_STEP_WAITING = "workflow_step_waiting"
    WORKFLOW_STEP_FAILED = "workflow_step_failed"
```

### on_global()

Register global event handler.

```python
def on_global(event_type: EventType, handler: Callable[[Event], None]) -> None
```

**Parameters:**
- `event_type` (EventType): Event type to listen for
- `handler` (Callable): Function to call when event occurs

**Example:**
```python
from abstractcore.events import EventType, on_global

def cost_monitor(event):
    cost = event.data.get("cost_usd")
    if cost:
        # NOTE: `cost_usd` is a best-effort estimate based on token usage.
        print(f"Estimated cost: ${cost:.4f}")

def tool_monitor(event):
    # Tool event payload shape varies by emitter.
    # - Single-tool execution: {"tool_name": ..., "success": ..., ...}
    # - Batch execution: {"tool_results": [{"name": ..., "success": ...}, ...], ...}
    tool_name = event.data.get("tool_name")
    if tool_name:
        print(f"Tool completed: {tool_name} success={event.data.get('success')}")
        return

    for r in event.data.get("tool_results", []) or []:
        print(f"Tool completed: {r.get('name')} success={r.get('success')} error={r.get('error')}")

# Register handlers
on_global(EventType.GENERATION_COMPLETED, cost_monitor)
on_global(EventType.TOOL_COMPLETED, tool_monitor)

# Now all LLM operations will trigger these handlers
llm = create_llm("openai", model="gpt-4o-mini")
response = llm.generate("Hello world")
```

### Event

Event object passed to handlers.

```python
@dataclass
class Event:
    type: EventType
    timestamp: datetime
    data: Dict[str, Any]
    source: Optional[str] = None
```

## Retry Configuration

### RetryConfig

Configuration for provider-level retry behavior.

```python
@dataclass
class RetryConfig:
    max_attempts: int = 3
    initial_delay: float = 1.0
    max_delay: float = 60.0
    exponential_base: float = 2.0
    use_jitter: bool = True
    failure_threshold: int = 5
    recovery_timeout: float = 60.0
    half_open_max_calls: int = 2
```

**Parameters:**
- `max_attempts` (int): Maximum retry attempts
- `initial_delay` (float): Initial delay in seconds
- `max_delay` (float): Maximum delay in seconds
- `exponential_base` (float): Base for exponential backoff
- `use_jitter` (bool): Add randomness to delays
- `failure_threshold` (int): Circuit breaker failure threshold
- `recovery_timeout` (float): Circuit breaker recovery timeout
- `half_open_max_calls` (int): Max calls in half-open state

**Example:**
```python
from abstractcore import create_llm
from abstractcore.core.retry import RetryConfig

config = RetryConfig(
    max_attempts=5,
    initial_delay=2.0,
    use_jitter=True,
    failure_threshold=3
)

llm = create_llm("openai", model="gpt-4o-mini", retry_config=config)
```

### FeedbackRetry

Retry strategy for structured output validation failures.

```python
class FeedbackRetry:
    def __init__(self, max_attempts: int = 3):
        self.max_attempts = max_attempts
```

**Example:**
```python
from abstractcore.structured import FeedbackRetry
from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

custom_retry = FeedbackRetry(max_attempts=5)

user = llm.generate(
    "Extract user: John Doe, 25",
    response_model=User,
    retry_strategy=custom_retry
)
```

## Embeddings

### EmbeddingManager

Manages text embeddings using SOTA models.

```python
class EmbeddingManager:
    def __init__(
        self,
        model: str = "embeddinggemma",
        backend: str = "auto",
        output_dims: Optional[int] = None,
        cache_size: int = 1000,
        cache_dir: Optional[str] = None
    ):
```

**Parameters:**
- `model` (str): Model name ("embeddinggemma", "granite", "stella-400m")
- `backend` (str): Backend ("auto", "pytorch", "onnx")
- `output_dims` (int, optional): Truncate output dimensions
- `cache_size` (int): Memory cache size
- `cache_dir` (str, optional): Disk cache directory

**Methods:**

#### embed()
```python
def embed(self, text: str) -> List[float]
```
Generate embedding for single text.

#### embed_batch()
```python
def embed_batch(self, texts: List[str]) -> List[List[float]]
```
Generate embeddings for multiple texts (more efficient).

#### compute_similarity()
```python
def compute_similarity(self, text1: str, text2: str) -> float
```
Compute cosine similarity between two texts.

**Example:**
```python
from abstractcore.embeddings import EmbeddingManager

embedder = EmbeddingManager(model="embeddinggemma")

# Single embedding
embedding = embedder.embed("Hello world")
print(f"Embedding dimension: {len(embedding)}")

# Batch embeddings
embeddings = embedder.embed_batch(["Hello", "World", "AI"])

# Similarity
similarity = embedder.compute_similarity("cat", "kitten")
print(f"Similarity: {similarity:.3f}")
```

## Exceptions

### Base Exceptions

#### AbstractCoreError
```python
class AbstractCoreError(Exception):
    """Base exception for AbstractCore."""
```

#### ProviderAPIError
```python
class ProviderAPIError(AbstractCoreError):
    """Provider API error."""
```

#### ModelNotFoundError
```python
class ModelNotFoundError(AbstractCoreError):
    """Model not found error."""
```

#### AuthenticationError
```python
class AuthenticationError(ProviderAPIError):
    """Authentication error."""
```

#### RateLimitError
```python
class RateLimitError(ProviderAPIError):
    """Rate limit error."""
```

### Usage

```python
from abstractcore.exceptions import ProviderAPIError, RateLimitError

try:
    response = llm.generate("Hello world")
except RateLimitError:
    print("Rate limited, wait and retry")
except ProviderAPIError as e:
    print(f"API error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## Advanced Usage Patterns

### Custom Provider Configuration

```python
# Provider with all options
llm = create_llm(
    provider="openai",
    model="gpt-4o-mini",
    api_key="your-key",
    temperature=0.7,
    max_tokens=1000,
    top_p=0.9,
    timeout=30,
    retry_config=RetryConfig(max_attempts=5)
)
```

### Multi-Provider Setup

```python
providers = {
    "fast": create_llm("openai", model="gpt-4o-mini"),
    "smart": create_llm("openai", model="gpt-4o"),
    "long_context": create_llm("anthropic", model="claude-haiku-4-5"),
    "local": create_llm("ollama", model="qwen2.5-coder:7b")
}

def route_request(prompt, task_type="general"):
    if task_type == "simple":
        return providers["fast"].generate(prompt)
    elif task_type == "complex":
        return providers["smart"].generate(prompt)
    elif len(prompt) > 50000:
        return providers["long_context"].generate(prompt)
    else:
        return providers["local"].generate(prompt)
```

### Production Monitoring

```python
from abstractcore.events import EventType, on_global
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Cost tracking
total_cost = 0.0

def production_monitor(event):
    global total_cost

    if event.type == EventType.GENERATION_COMPLETED:
        cost = event.data.get("cost_usd")
        if cost:
            # NOTE: `cost_usd` is a best-effort estimate based on token usage.
            total_cost += float(cost)
            logger.info(f"Estimated cost: ${float(cost):.4f}, Total: ${total_cost:.4f}")

        duration_ms = event.data.get("duration_ms")
        if isinstance(duration_ms, (int, float)) and duration_ms > 10_000:
            logger.warning(f"Slow request: {float(duration_ms):.0f}ms")

    elif event.type == EventType.ERROR:
        logger.error(f"Error: {event.data.get('error')}")

    elif event.type == EventType.RETRY_ATTEMPTED:
        logger.info(f"Retrying due to: {event.data.get('error_type')}")

on_global(EventType.GENERATION_COMPLETED, production_monitor)
on_global(EventType.ERROR, production_monitor)
on_global(EventType.RETRY_ATTEMPTED, production_monitor)
```

---

For more examples and use cases, see:
- [Getting Started](getting-started.md) - Basic setup and usage
- [Examples](examples.md) - Practical use cases
- [Prerequisites](prerequisites.md) - Provider setup and configuration
- [Capabilities](capabilities.md) - What AbstractCore can do
