Endpoint Reference

This page documents all available API endpoints with their parameters, request formats, and response structures.

Chat Completions

POST /v1/chat/completions

The primary endpoint. Send a conversation (list of messages) and receive an AI response.

Request Parameters

Parameter	Type	Required	Default	Description
`model`	string	Yes	—	Model ID (see `/v1/models` for options)
`messages`	array	Yes	—	Array of message objects with `role` and `content`
`max_tokens`	integer	No	150	Maximum tokens in the response
`temperature`	number	No	1.0	Sampling temperature (0.0–2.0). Lower = more deterministic
`top_p`	number	No	1.0	Nucleus sampling threshold
`stream`	boolean	No	false	Enable server-sent events streaming
`stop`	string or array	No	null	Stop sequence(s)
`presence_penalty`	number	No	0	Penalise tokens already in context (-2.0 to 2.0)
`frequency_penalty`	number	No	0	Penalise frequent tokens (-2.0 to 2.0)
`tools`	array	No	null	List of tools/functions the model may call (OpenAI format)
`tool_choice`	string/object	No	auto	Controls tool calling: `none`, `auto`, `required`, or `{"type": "function", "function": {"name": "..."}}`

Extended Parameters

These parameters are supported by our vLLM-based inference backends and follow the OpenAI API specification. All are optional — if omitted, provider defaults apply.

Parameter	Type	Default	Description
`reasoning_effort`	string	provider default	Controls reasoning depth for thinking models: `low`, `medium`, or `high`. Lower values produce faster responses with less internal reasoning
`max_completion_tokens`	integer	provider default	Upper bound for generated tokens including reasoning tokens. Use this instead of `max_tokens` when working with reasoning models
`response_format`	object	null	Request structured output: `{"type": "json_object"}` for JSON mode, or `{"type": "json_schema", "json_schema": {...}}` for schema-constrained output
`seed`	integer	null	For deterministic sampling (best-effort, not guaranteed). Repeated requests with the same seed and parameters should return similar results
`logprobs`	boolean	false	Return log probabilities of output tokens
`top_logprobs`	integer	null	Number of most likely tokens to return per position (0–20). Requires `logprobs: true`
`parallel_tool_calls`	boolean	true	Enable parallel function calling during tool use

When to use reasoning parameters

For reasoning models (e.g., models with "Reasoning" in their name), use reasoning_effort to control the balance between speed and accuracy. Set reasoning_effort: "low" for fast tool-calling workflows, or "high" for complex multi-step reasoning tasks. Use max_completion_tokens instead of max_tokens to ensure the token budget covers both thinking and output.

Message Object

{
  "role": "system" | "user" | "assistant",
  "content": "Your message text"
}

system — sets the AI's behaviour and instructions
user — the human's input
assistant — previous AI responses (for multi-turn context)

Non-Streaming Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello! I can help you with..."
      },
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 45,
    "total_tokens": 57
  }
}

Streaming Response

Set stream: true to receive server-sent events.

Content-Type: text/event-stream
Each chunk is a line prefixed with data: followed by a JSON object
The stream ends with data: [DONE]

Example chunk:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}

Concatenate all delta.content values to build the full response.

Text Completions

POST /v1/completions

Generate text from a prompt string (non-chat format). Internally converted to chat format.

Request Parameters

Parameter	Type	Required	Default	Description
`model`	string	Yes	—	Model ID
`prompt`	string	Yes	—	The text prompt
`max_tokens`	integer	No	16	Maximum tokens
`temperature`	number	No	1.0	Sampling temperature
`top_p`	number	No	1.0	Nucleus sampling
`stream`	boolean	No	false	Enable streaming
`stop`	string/array	No	null	Stop sequence(s)

Response

{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "choices": [
    {
      "text": "Generated text here...",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 20,
    "total_tokens": 25
  }
}

tip

For most use cases, we recommend using /v1/chat/completions instead. The chat format gives you more control via system messages and multi-turn conversations.

List Models

GET /v1/models

Returns all available models.

Response

{
  "object": "list",
  "data": [
    {
      "id": "model-id",
      "object": "model",
      "owned_by": "schatzi"
    }
  ]
}

info

Model availability may change. Check this endpoint programmatically rather than hardcoding model IDs. See Model Comparison for capabilities and pricing.

Retrieve Model

GET /v1/models/{model}

Returns details for a specific model by its ID. The response format matches a single entry from the List Models endpoint.

Embeddings

POST /v1/embeddings

Generate vector embeddings for text input.

Request Parameters

Parameter	Type	Required	Description
`model`	string	Yes	Model ID (must support embeddings)
`input`	string or array	Yes	Text to embed (single string or array of strings)

info

Not all models support embeddings. Use /v1/models to check which models are available for this endpoint.

Error Responses

All errors follow a consistent format:

{
  "error": {
    "message": "Human-readable description",
    "type": "error_type",
    "code": "error_code"
  }
}

Error Codes Reference

HTTP Status	Type	When
400	`invalid_request_error`	Missing or invalid parameters
402	`subscription_required`	No active subscription
403	Forbidden	Invalid or revoked API key
429	`rate_limit_exceeded`	Usage limit reached for billing period
503	`model_unavailable`	Model temporarily unavailable
500	`api_error`	Internal server error

429 responses: your monthly CHF budget is exhausted. Check your usage in the dashboard or via the usage API. Usage resets at the start of each billing period.
503 responses: the model is temporarily experiencing issues. Retry after a short delay or try a different model.

Chat Completions​

Request Parameters​

Extended Parameters​

Message Object​

Non-Streaming Response​

Streaming Response​

Text Completions​

Request Parameters​

Response​

List Models​

Response​

Retrieve Model​

Embeddings​

Request Parameters​

Error Responses​

Error Codes Reference​

Chat Completions

Request Parameters

Extended Parameters

Message Object

Non-Streaming Response

Streaming Response

Text Completions

Request Parameters

Response

List Models

Response

Retrieve Model

Embeddings

Request Parameters

Error Responses

Error Codes Reference