Skip to main content

Endpoint Reference

This page documents all available API endpoints with their parameters, request formats, and response structures.

Chat Completions

POST /v1/chat/completions

The primary endpoint. Send a conversation (list of messages) and receive an AI response.

Request Parameters

ParameterTypeRequiredDefaultDescription
modelstringYesModel ID (see /v1/models for options)
messagesarrayYesArray of message objects with role and content
max_tokensintegerNo150Maximum tokens in the response
temperaturenumberNo1.0Sampling temperature (0.0–2.0). Lower = more deterministic
top_pnumberNo1.0Nucleus sampling threshold
streambooleanNofalseEnable server-sent events streaming
stopstring or arrayNonullStop sequence(s)
presence_penaltynumberNo0Penalise tokens already in context (-2.0 to 2.0)
frequency_penaltynumberNo0Penalise frequent tokens (-2.0 to 2.0)
toolsarrayNonullList of tools/functions the model may call (OpenAI format)
tool_choicestring/objectNoautoControls tool calling: none, auto, required, or {"type": "function", "function": {"name": "..."}}

Extended Parameters

These parameters are supported by our vLLM-based inference backends and follow the OpenAI API specification. All are optional — if omitted, provider defaults apply.

ParameterTypeDefaultDescription
reasoning_effortstringprovider defaultControls reasoning depth for thinking models: low, medium, or high. Lower values produce faster responses with less internal reasoning
max_completion_tokensintegerprovider defaultUpper bound for generated tokens including reasoning tokens. Use this instead of max_tokens when working with reasoning models
response_formatobjectnullRequest structured output: {"type": "json_object"} for JSON mode, or {"type": "json_schema", "json_schema": {...}} for schema-constrained output
seedintegernullFor deterministic sampling (best-effort, not guaranteed). Repeated requests with the same seed and parameters should return similar results
logprobsbooleanfalseReturn log probabilities of output tokens
top_logprobsintegernullNumber of most likely tokens to return per position (0–20). Requires logprobs: true
parallel_tool_callsbooleantrueEnable parallel function calling during tool use
When to use reasoning parameters

For reasoning models (e.g., models with "Reasoning" in their name), use reasoning_effort to control the balance between speed and accuracy. Set reasoning_effort: "low" for fast tool-calling workflows, or "high" for complex multi-step reasoning tasks. Use max_completion_tokens instead of max_tokens to ensure the token budget covers both thinking and output.

Message Object

{
"role": "system" | "user" | "assistant",
"content": "Your message text"
}
  • system — sets the AI's behaviour and instructions
  • user — the human's input
  • assistant — previous AI responses (for multi-turn context)

Non-Streaming Response

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"choices": [
{
"message": {
"role": "assistant",
"content": "Hello! I can help you with..."
},
"finish_reason": "stop",
"index": 0
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 45,
"total_tokens": 57
}
}

Streaming Response

Set stream: true to receive server-sent events.

  • Content-Type: text/event-stream
  • Each chunk is a line prefixed with data: followed by a JSON object
  • The stream ends with data: [DONE]

Example chunk:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}

Concatenate all delta.content values to build the full response.


Text Completions

POST /v1/completions

Generate text from a prompt string (non-chat format). Internally converted to chat format.

Request Parameters

ParameterTypeRequiredDefaultDescription
modelstringYesModel ID
promptstringYesThe text prompt
max_tokensintegerNo16Maximum tokens
temperaturenumberNo1.0Sampling temperature
top_pnumberNo1.0Nucleus sampling
streambooleanNofalseEnable streaming
stopstring/arrayNonullStop sequence(s)

Response

{
"id": "cmpl-abc123",
"object": "text_completion",
"choices": [
{
"text": "Generated text here...",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 20,
"total_tokens": 25
}
}
tip

For most use cases, we recommend using /v1/chat/completions instead. The chat format gives you more control via system messages and multi-turn conversations.


List Models

GET /v1/models

Returns all available models.

Response

{
"object": "list",
"data": [
{
"id": "model-id",
"object": "model",
"owned_by": "schatzi"
}
]
}
info

Model availability may change. Check this endpoint programmatically rather than hardcoding model IDs. See Model Comparison for capabilities and pricing.


Retrieve Model

GET /v1/models/{model}

Returns details for a specific model by its ID. The response format matches a single entry from the List Models endpoint.


Embeddings

POST /v1/embeddings

Generate vector embeddings for text input.

Request Parameters

ParameterTypeRequiredDescription
modelstringYesModel ID (must support embeddings)
inputstring or arrayYesText to embed (single string or array of strings)
info

Not all models support embeddings. Use /v1/models to check which models are available for this endpoint.


Error Responses

All errors follow a consistent format:

{
"error": {
"message": "Human-readable description",
"type": "error_type",
"code": "error_code"
}
}

Error Codes Reference

HTTP StatusTypeWhen
400invalid_request_errorMissing or invalid parameters
402subscription_requiredNo active subscription
403ForbiddenInvalid or revoked API key
429rate_limit_exceededUsage limit reached for billing period
503model_unavailableModel temporarily unavailable
500api_errorInternal server error
  • 429 responses: your monthly CHF budget is exhausted. Check your usage in the dashboard or via the usage API. Usage resets at the start of each billing period.
  • 503 responses: the model is temporarily experiencing issues. Retry after a short delay or try a different model.