Endpoint Reference
This page documents all available API endpoints with their parameters, request formats, and response structures.
Chat Completions
POST /v1/chat/completions
The primary endpoint. Send a conversation (list of messages) and receive an AI response.
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | — | Model ID (see /v1/models for options) |
messages | array | Yes | — | Array of message objects with role and content |
max_tokens | integer | No | 150 | Maximum tokens in the response |
temperature | number | No | 1.0 | Sampling temperature (0.0–2.0). Lower = more deterministic |
top_p | number | No | 1.0 | Nucleus sampling threshold |
stream | boolean | No | false | Enable server-sent events streaming |
stop | string or array | No | null | Stop sequence(s) |
presence_penalty | number | No | 0 | Penalise tokens already in context (-2.0 to 2.0) |
frequency_penalty | number | No | 0 | Penalise frequent tokens (-2.0 to 2.0) |
tools | array | No | null | List of tools/functions the model may call (OpenAI format) |
tool_choice | string/object | No | auto | Controls tool calling: none, auto, required, or {"type": "function", "function": {"name": "..."}} |
Extended Parameters
These parameters are supported by our vLLM-based inference backends and follow the OpenAI API specification. All are optional — if omitted, provider defaults apply.
| Parameter | Type | Default | Description |
|---|---|---|---|
reasoning_effort | string | provider default | Controls reasoning depth for thinking models: low, medium, or high. Lower values produce faster responses with less internal reasoning |
max_completion_tokens | integer | provider default | Upper bound for generated tokens including reasoning tokens. Use this instead of max_tokens when working with reasoning models |
response_format | object | null | Request structured output: {"type": "json_object"} for JSON mode, or {"type": "json_schema", "json_schema": {...}} for schema-constrained output |
seed | integer | null | For deterministic sampling (best-effort, not guaranteed). Repeated requests with the same seed and parameters should return similar results |
logprobs | boolean | false | Return log probabilities of output tokens |
top_logprobs | integer | null | Number of most likely tokens to return per position (0–20). Requires logprobs: true |
parallel_tool_calls | boolean | true | Enable parallel function calling during tool use |
For reasoning models (e.g., models with "Reasoning" in their name), use reasoning_effort to control the balance between speed and accuracy. Set reasoning_effort: "low" for fast tool-calling workflows, or "high" for complex multi-step reasoning tasks. Use max_completion_tokens instead of max_tokens to ensure the token budget covers both thinking and output.
Message Object
{
"role": "system" | "user" | "assistant",
"content": "Your message text"
}
system— sets the AI's behaviour and instructionsuser— the human's inputassistant— previous AI responses (for multi-turn context)
Non-Streaming Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "Hello! I can help you with..."
},
"finish_reason": "stop",
"index": 0
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 45,
"total_tokens": 57
}
}
Streaming Response
Set stream: true to receive server-sent events.
- Content-Type:
text/event-stream - Each chunk is a line prefixed with
data:followed by a JSON object - The stream ends with
data: [DONE]
Example chunk:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}
Concatenate all delta.content values to build the full response.
Text Completions
POST /v1/completions
Generate text from a prompt string (non-chat format). Internally converted to chat format.
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | — | Model ID |
prompt | string | Yes | — | The text prompt |
max_tokens | integer | No | 16 | Maximum tokens |
temperature | number | No | 1.0 | Sampling temperature |
top_p | number | No | 1.0 | Nucleus sampling |
stream | boolean | No | false | Enable streaming |
stop | string/array | No | null | Stop sequence(s) |
Response
{
"id": "cmpl-abc123",
"object": "text_completion",
"choices": [
{
"text": "Generated text here...",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 20,
"total_tokens": 25
}
}
For most use cases, we recommend using /v1/chat/completions instead. The chat format gives you more control via system messages and multi-turn conversations.
List Models
GET /v1/models
Returns all available models.
Response
{
"object": "list",
"data": [
{
"id": "model-id",
"object": "model",
"owned_by": "schatzi"
}
]
}
Model availability may change. Check this endpoint programmatically rather than hardcoding model IDs. See Model Comparison for capabilities and pricing.
Retrieve Model
GET /v1/models/{model}
Returns details for a specific model by its ID. The response format matches a single entry from the List Models endpoint.
Embeddings
POST /v1/embeddings
Generate vector embeddings for text input.
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID (must support embeddings) |
input | string or array | Yes | Text to embed (single string or array of strings) |
Not all models support embeddings. Use /v1/models to check which models are available for this endpoint.
Error Responses
All errors follow a consistent format:
{
"error": {
"message": "Human-readable description",
"type": "error_type",
"code": "error_code"
}
}
Error Codes Reference
| HTTP Status | Type | When |
|---|---|---|
| 400 | invalid_request_error | Missing or invalid parameters |
| 402 | subscription_required | No active subscription |
| 403 | Forbidden | Invalid or revoked API key |
| 429 | rate_limit_exceeded | Usage limit reached for billing period |
| 503 | model_unavailable | Model temporarily unavailable |
| 500 | api_error | Internal server error |
- 429 responses: your monthly CHF budget is exhausted. Check your usage in the dashboard or via the usage API. Usage resets at the start of each billing period.
- 503 responses: the model is temporarily experiencing issues. Retry after a short delay or try a different model.