Local Models¶
Run LLMs directly on your Android device using GGUF model files and llama.cpp. No network required, no API keys, fully private.
How It Works¶
Agora bundles llama.cpp via Android NDK (CMake). When you import a GGUF file, the model runs entirely on your device's CPU — no data leaves the device.
Requirements¶
- GGUF format models only (the standard for llama.cpp)
- Device memory: The model must fit in available RAM. As a rule of thumb:
- 1–3B parameter models: 4–6 GB RAM
- 7–8B parameter models: 6–8 GB RAM
- Storage: GGUF files range from ~500 MB (quantized small models) to 5+ GB
Warning
Local inference is CPU-intensive and much slower than cloud APIs. It's best for offline use, privacy-sensitive content, or experimentation — not for fast, high-volume chat.
Importing a Chat Model¶
- Download a GGUF model file to your device (see recommended sources below)
- Go to Settings → Provider
- Select Local as the provider
- Tap Import GGUF Model
- Select the
.gguffile from your device - Configure the model:
| Parameter | Description | Example |
|---|---|---|
| Model ID | Lowercase identifier, no spaces | qwen3-8b |
| Alias | Display name | Qwen 3 8B |
| Context Size | Maximum context window in tokens | 4096 |
| Temperature | Randomness (0.0–2.0) | 0.7 |
| Top P | Nucleus sampling threshold (0.0–1.0) | 0.9 |
| Max Tokens | Maximum generation length | 2048 |
- Tap Add
The model is imported and ready to use immediately.
Importing an Embedding Model¶
Embedding models are smaller and used for semantic search:
- Go to Settings → Conversation Search
- Tap Add Local Model
- Select a
.ggufembedding model file - Give it a name
- Tap Add
See Embedding / RAG for search setup.
Selecting the Active Model¶
After importing one or more models:
- Go to Settings → Provider → Local
- You'll see all imported models listed
- Tap the radio button next to the model you want to use
- The selected model becomes active when Local is chosen as the chat provider
Managing Local Models¶
Rename¶
Tap a model to change its alias or adjust parameters (temperature, context size, etc.).
Delete¶
Long-press a model and tap Delete. This removes the model from Agora and deletes the GGUF file from storage.
Recommended Models¶
Chat Models¶
| Model | Size | RAM Needed | Notes |
|---|---|---|---|
| Qwen 3 1.7B | ~1 GB | 3–4 GB | Good quality for its size |
| Llama 3.2 3B | ~2 GB | 4–5 GB | Solid all-around |
| Qwen 3 8B | ~5 GB | 7–8 GB | Best quality, high RAM |
Embedding Models¶
| Model | Size | Notes |
|---|---|---|
| BGE Small EN v1.5 | ~130 MB | Good English embeddings |
| BGE Small ZH v1.5 | ~130 MB | Chinese-optimized |
| Nomic Embed Text v1.5 | ~270 MB | Good multilingual |
Where to Get GGUF Files¶
- Hugging Face — search for "GGUF"
- bartowski's quantized models — wide selection, well-organized
Tip
Look for Q4_K_M quantization — it offers the best trade-off between quality and size for chat models.
Performance Tips¶
- Smaller context = faster: Start with 2048 and increase only if needed
- Lower quant = faster: Q4_K_M is faster than Q6_K or Q8
- Close other apps: Local inference needs as much RAM as possible
- Plug in: Inference is CPU-intensive, and prolonged use drains the battery