ZeroGPU

The compute efficient layer for AI inference, routing high-volume tasks to small and nano models to cut cost and latency.

Paid
·
Web, API
·
Other

Try ZeroGPU

What is ZeroGPU?

ZeroGPU is an AI inference platform that routes routine, high-volume tasks to specialised small and nano language models across an edge-powered network with cloud fallback. It offers an OpenAI-compatible chat and responses API so developers can offload work such as classification, extraction and moderation from expensive frontier models. The service is billed on usage, with per-token rates that the company claims reduce inference costs by more than half. It targets teams running large volumes of repetitive AI workloads at scale.

Key features

OpenAI-compatible API

Drop-in chat and responses endpoints that work with existing OpenAI client code.

Small and nano model routing

Routes routine tasks to specialised small language models instead of frontier models.

Edge-powered inference

Distributed inference across an edge network with cloud fallback for scale and lower latency.

Cost calculator

An interactive tool to estimate spend and compare ZeroGPU model rates against providers such as GPT-5.4.

Project-level API keys

Per-project keys with usage analytics for tracking consumption.

Task-specific workloads

Built for classification, extraction, summarisation, PII detection, fraud detection, content moderation and query routing.

Monetize Your App SDK

An SDK that lets app developers earn revenue by having their users' idle devices fulfil network inference jobs.

Pros & cons

Advantages

The OpenAI-compatible API means existing applications can switch with minimal code changes.
Per-token rates are very low for supported small models, with a claimed cost reduction of more than 50 percent on routine workloads.
It covers a wide range of practical use cases including moderation, fraud detection, translation and document processing.
The interactive calculator gives transparent per-million-token rates and side-by-side cost comparisons before committing.
Usage-based billing means there is no fixed subscription, so spend scales with actual volume.

Limitations

There is no dedicated pricing page, so full rate cards across all models are not published and must be gathered from the calculator.
The model catalogue is geared towards smaller models, so it is not a substitute for top-tier frontier reasoning models.
No free tier or sign-up credits are advertised on the public pages, so evaluation cost is unclear up front.
Documentation lives on a separate docs subdomain and some linked pages were not reachable during review.

Use cases

AdTech teams classifying user intent and mapping IAB categories in real time for ad targeting.

Trust and safety teams moderating millions of posts with toxicity detection, NSFW filtering and policy checks.

Compliance and security teams redacting PII and detecting prompt injection or jailbreak attempts at scale.

Fintech teams running fraud detection, transaction risk scoring and trading sentiment analysis.

Document-heavy businesses extracting structured data from invoices, contracts and forms via OCR.

AI agent developers using small models for tool selection, query routing and multi-step reasoning cheaply.

Ready to try ZeroGPU?

Try ZeroGPU

Pricing

Usage-based

From $0.02 in / $0.05 out per 1M tokens (llama-3.1-8b-instruct-fast)

Pay-as-you-go per-token billing, OpenAI-compatible API, project-level API keys and usage analytics. Rates vary by model; use the cost calculator to estimate spend.

Get Usage-based

Get started with ZeroGPU

Click through to ZeroGPU and start using it now.

Try ZeroGPU