经 AI Skill Hub 精选评估,TensorSharp 获评「强烈推荐」。这款AI工具在功能完整性、社区活跃度和易用性方面表现出色,AI 评分 8.0 分,适合有一定技术背景的用户使用。
TensorSharp 是一款基于 C# 开发的开源工具,专注于 LLM、C#、GGU 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
TensorSharp 是一款基于 C# 开发的开源工具,专注于 LLM、C#、GGU 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 克隆仓库 git clone https://github.com/zhongkaifu/TensorSharp cd TensorSharp # 查看安装说明 cat README.md # 按 README 完成环境依赖安装后即可使用
# 查看帮助 tensorsharp --help # 基本运行 tensorsharp [options] <input> # 详细使用说明请查阅文档 # https://github.com/zhongkaifu/TensorSharp
# tensorsharp 配置说明 # 查看配置选项 tensorsharp --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export TENSORSHARP_CONFIG="/path/to/config.yml"
# TensorSharp
<p align="center"> <img src="imgs/banner_1.png" alt="TensorSharp logo" width="320"> </p>
A C# inference engine for running GGUF language models locally, including autoregressive LLMs and DiffusionGemma-style text-diffusion models. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access.
gemma4-assistant draft GGUF). The draft proposes several tokens per step and the trunk verifies them in one batched forward, with the request's own sampler driving both. Opt in with --mtp-spec (+ --mtp-draft-model for Gemma 4). → Speculative decoding---
Everything below is detailed reference. New here? The five sections above are all you need to get running.
<think> / <|channel>thought / <|channel>analysis tags (Qwen 3, Qwen 3.5/3.6-family, Gemma 4, GPT OSS, Nemotron-H)TSGgml_PagedAttentionForward) that drives ggml_flash_attn_ext on Metal/CUDA. Enabled by default in TensorSharp.Server; opt-out with --no-continuous-batching. See docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md.gemma4-assistant draft GGUF via --mtp-draft-model whose draft layers attend the target's own KV cache. The draft proposes up to --mtp-draft tokens per step (kept while draft confidence ≥ --mtp-pmin) and the trunk verifies them in a single batched forward; the request's own sampler — penalties included — drives both drafting and verification, so output is identical to standard decode. Opt in with --mtp-spec (off by default). On ggml backends fused multi-token-verify / draft-step kernels make it a clear win; the pure-C# cuda backend runs a fully GPU-resident per-op verify/draft and is also a win. CPU / MLX stay on standard decode. Env: TS_MTP_* (shared) and TS_GMTP_* (Gemma 4 tuning).IBatchedPagedModel.ForwardBatch implementations for Mistral 3, Gemma 4, GPT OSS, Qwen 3, Qwen 3.5/3.6-family, and Nemotron-H all run by default and pack N sequences into a single forward pass with paged K/V scatter and per-sequence attention via the native kernel. Each model exposes a TS_<FAMILY>_BATCHED=0 escape hatch (e.g. TS_GEMMA4_BATCHED=0, TS_QWEN35_BATCHED=0, TS_GPTOSS_BATCHED=0, TS_NEMOTRON_BATCHED=0) to fall back to the per-sequence KV-swap path for A/B comparison or regression isolation.InferenceEngine (worker-thread scheduler + paged block pool) replaces the legacy single-request FIFO queue inside TensorSharp.Server. The old queue object is now a compatibility shim for status/event shapes; the engine itself handles concurrency.Forward(). The CLI exposes --diffusion-steps, --diffusion-seed, and --diffusion-blocks; the Web UI streams whole-message replace events for live denoising previews and batches concurrent diffusion requests through DiffusionBatchScheduler.TSGgml_NemotronMamba2BatchedStepF32, NEON SIMD + GCD parallelism) used by the batched path.qwen35moe / qwen3next variants such as Qwen3.5-35B-A3B), and Nemotron-H MoE FFN layersIKvBlockCodec) with a built-in TurboQuant (Q4 / Q8) compressed codec for paged blocks, configurable via --paged-kv-quant-bits<think> reasoning and the final result) plus the KV cache hit ratio. The same cache-hit stats are surfaced through every API: prompt_cache_hit_tokens / prompt_cache_hit_ratio (Ollama), usage.prompt_tokens_details.cached_tokens (OpenAI), and promptTokens / kvReusedTokens / kvReusePercent in the Web UI SSE done eventgit and network access: the GGML/CUDA native builds clone the ggml sources from github.com/ggml-org/ggml into ExternalProjects/ggml/ on first build (see eng/fetch-ggml.sh / eng/fetch-ggml.ps1). The clone tracks ggml's default branch (master); pin a different ref with TENSORSHARP_GGML_GIT_REF, or set TENSORSHARP_GGML_NO_UPDATE=1 to skip the network update once cloned (offline rebuilds)libmlxc from TensorSharp.Backends.MLX/Native/ via bash TensorSharp.Backends.MLX/build-native-macos.shggml_cuda or cuda, install an NVIDIA driver plus CUDA Toolkit 12.x or another compatible CUDA toolkit with cuBLASggml_cuda or cuda, install an NVIDIA driver plus CUDA Toolkit 12.x or another compatible CUDA toolkit with cuBLASdotnet build TensorSharp.slnx
```bash
The native library is built automatically during the first dotnet build if it doesn't exist. To build it manually:
cd TensorSharp.GGML.Native
macOS:
bash build-macos.sh
Linux (CPU-only):
bash build-linux.sh
Linux (GGML_CUDA enabled):
bash build-linux.sh --cuda
Windows (CPU-only):
.\build-windows.ps1 --no-cuda
Windows (GGML_CUDA enabled):
.\build-windows.ps1 --cuda
On Windows and Linux, the native build script auto-detects the visible NVIDIA GPU compute capability and passes a narrow CMAKE_CUDA_ARCHITECTURES value to ggml-cuda (for example 86-real on an RTX 3080), which cuts CUDA build time substantially. The native build also runs in parallel by default with a conservative job cap so nvcc does not overwhelm typical developer machines.
If you want to override the auto-detected architecture list or the default build parallelism, use either environment variables or explicit build flags:
TENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES='86-real;89-real' bash build-linux.sh --cuda
bash build-linux.sh --cuda --cuda-arch='86-real;89-real'
TENSORSHARP_GGML_NATIVE_BUILD_PARALLEL_LEVEL=2 bash build-linux.sh --cuda
$env:TENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES='86-real;89-real'; .\build-windows.ps1 --cuda
.\build-windows.ps1 --cuda --cuda-arch='86-real;89-real'
$env:TENSORSHARP_GGML_NATIVE_BUILD_PARALLEL_LEVEL=2; .\build-windows.ps1 --cuda
You can also request a CUDA-enabled native build from dotnet build:
TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release
$env:TENSORSHARP_GGML_NATIVE_ENABLE_CUDA='ON'; dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release
On macOS this compiles libGgmlOps.dylib with Metal GPU support. On Windows and Linux, the native scripts preserve an existing CUDA-enabled build and auto-enable GGML_CUDA when a CUDA toolchain is detected; build-windows.ps1 --cuda, build-linux.sh --cuda, and TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON force CUDA explicitly. The build output is automatically copied to the application's output directory.
The direct cuda backend is built as managed C# plus PTX kernels. During dotnet build, TensorSharp.Backends.Cuda compiles native/kernels/*.cu to native/ptx/*.ptx when nvcc is available; if nvcc is missing, the build continues and PTX-backed ops use CPU fallbacks. cuBLAS-backed GEMM still requires the CUDA runtime libraries to be discoverable at run time.
The MLX backend depends on libmlxc (the C bindings for MLX). The repository pins a known-good tag of mlx-c in TensorSharp.Backends.MLX/Native/MLX_C_VERSION and a helper script fetches and builds it:
bash TensorSharp.Backends.MLX/build-native-macos.sh
The script writes the resulting libraries (libmlxc.dylib, libmlx.dylib, and any backend deps) into TensorSharp.Backends.MLX/Native/dist/. At run time the backend probes the application directory first; you can also point it to a custom install with TENSORSHARP_MLX_LIBRARY=<path-to-libmlxc.dylib> or TENSORSHARP_MLX_LIBRARY_DIR=<dir-with-libmlxc>. If the library cannot be located the backend reports unavailable and --backend mlx is rejected at startup.
Zero to a streaming reply in about 30 seconds (after the model download).
1. Prerequisites — .NET 10 SDK, git, and (optionally) a GPU toolchain: NVIDIA → CUDA Toolkit 12.x; Apple Silicon → Xcode command-line tools (Metal is built in). Full list in Prerequisites.
2. Clone & build — the native GGML library is compiled automatically on the first build.
git clone https://github.com/zhongkaifu/TensorSharp.git
cd TensorSharp
dotnet build TensorSharp.slnx -c Release
3. Download a model — a small, well-tested starting point is Gemma-4-E4B (Q8_0) from ggml-org/gemma-4-E4B-it-GGUF. More options in Verified Models.
4. Run it — choose the --backend for your hardware (see Pick a Backend):
```bash
```
The CLI binary lands in TensorSharp.Cli/bin/... and the server in TensorSharp.Server/bin/... after the build. Full options: CLI usage · Server usage.
The repository is now split along package boundaries so consumers can depend on only the layers they actually need.
| Project | NuGet package | Public namespace | Responsibility |
|---|---|---|---|
TensorSharp.Core | TensorSharp.Core | TensorSharp | Tensor primitives, ops, allocators, storage, and device abstraction |
TensorSharp.Runtime | TensorSharp.Runtime | TensorSharp.Runtime | GGUF parsing, tokenizers, prompt rendering, sampling, output protocol parsing, paged KV cache, continuous-batching scheduler |
TensorSharp.Models | TensorSharp.Models | TensorSharp.Models | ModelBase, architecture implementations, multimodal encoders, batched / paged forward passes, and model-side execution helpers |
TensorSharp.Backends.GGML | TensorSharp.Backends.GGML | TensorSharp.GGML | GGML-backed execution and native interop |
TensorSharp.Backends.Cuda | TensorSharp.Backends.Cuda | TensorSharp.Cuda | Direct CUDA allocator, storage, cuBLAS GEMM, PTX kernels, and quantized CUDA ops |
TensorSharp.Backends.MLX | TensorSharp.Backends.MLX | TensorSharp.MLX | Apple Silicon MLX backend (mlx-c / Metal) with quantized / fused / compiled kernels and MoE expert offload |
TensorSharp.Server | TensorSharp.Server | TensorSharp.Server | ASP.NET Core server, OpenAI/Ollama adapters, inference engine host, web UI |
TensorSharp.Cli | TensorSharp.Cli | TensorSharp.Cli | Console host and debugging / batch tooling |
This split keeps engine users off the web stack, keeps API-layer changes from leaking into core/runtime packages, and makes future benchmark or eval-harness projects easier to publish independently.
Validate package metadata and README dependency boundaries before publishing:
pwsh ./eng/verify-packages.ps1
The verifier runs dotnet pack for the public packages above and fails if an internal dependency such as AdvUtils leaks into the .nuspec, or if a TensorSharp package depends on a layer outside this table.
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal \ --bench-kvcache --bench-kv-turns 4 --max-tokens 64
高性能的C#推理引擎,支持本地运行LLM
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ BSD 3-Clause — 宽松协议,可商用修改分发,禁止使用原作者名称进行背书宣传。
AI Skill Hub 点评:TensorSharp 的核心功能完整,质量优秀。对于AI 技术爱好者来说,这是一个值得纳入个人工具库的选择。建议先在非生产环境试用,再逐步推广。
| 原始名称 | TensorSharp |
| 原始描述 | 开源AI工具:A C# inference engine for running large language models (LLMs) locally using GGU。⭐100 · C# |
| Topics | LLMC#GGUinference |
| GitHub | https://github.com/zhongkaifu/TensorSharp |
| License | BSD-3-Clause |
| 语言 | C# |
收录时间:2026-06-17 · 更新时间:2026-06-17 · License:BSD-3-Clause · AI Skill Hub 不对第三方内容的准确性作法律背书。