OpenAI-Compatible API

Shroud provides an OpenAI-compatible chat completions endpoint, allowing you to use existing OpenAI client libraries and tools with Shroud's confidential inference.

Endpoints

Chat Completions

POST /v1/chat/completions Authorization: Bearer shroud_prod_... Content-Type: application/json

List Models

GET /v1/models Authorization: Bearer shroud_prod_...

Request Format

The request format follows the OpenAI Chat Completions API:

{ "model": "Qwen/Qwen3-32B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain zero-knowledge proofs"} ], "max_tokens": 512, "stream": true }

Parameters

Parameter

Type

Required

Description

model

string

Yes

Model ID (e.g., Qwen/Qwen3-32B)

messages

array

Yes

Array of chat messages

max_tokens

integer

No

Maximum tokens to generate

stream

boolean

No

Enable streaming responses (default: false)

stream_options.include_usage

boolean

No

When stream is true, ask for a final usage chunk before [DONE]. See SSE streaming protocol.

chat_template_kwargs

object

No

Pass-through hook for the model's chat template. Used to opt in to chain-of-thought ({"enable_thinking": true}). See Reasoning content.

Message Format

Field

Type

Description

role

string

system, user, or assistant

content

string or array

Message content. See note on multimodal arrays below.

Response Format

Non-Streaming

{ "id": "chatcmpl-abc123", "object": "chat.completion", "model": "Qwen/Qwen3-32B", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Zero-knowledge proofs are..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 15, "completion_tokens": 128, "total_tokens": 143 } }

Streaming (SSE)

When stream: true, the response uses Server-Sent Events. The wire format and parsing rules — including the optional usage chunk emitted when stream_options.include_usage: true is set — are documented in detail in SSE streaming protocol.

data: {"choices":[{"delta":{"role":"assistant"},"index":0}]} data: {"choices":[{"delta":{"content":"Zero"},"index":0}]} data: {"choices":[{"delta":{"content":"-knowledge"},"index":0}]} data: {"choices":[{"delta":{"content":" proofs"},"index":0}]} data: {"choices":[{"delta":{"reasoning_content":"Let me think..."},"index":0}]} data: [DONE]

The delta object may contain:

  • content — generated text. For reasoning-capable models like Qwen3-32B this is clean answer text by default; opt in to chain-of-thought via chat_template_kwargs.enable_thinking=true and the response will include raw <think>...</think> tags inline (see Reasoning content and <think> tags below).

  • reasoning_content — chain-of-thought reasoning (if supported by the model)

Reasoning content and <think> tags

Qwen/Qwen3-32B is a reasoning-capable model. By default, Shroud disables chain-of-thought generation for this model so /v1/chat/completions returns clean answer text in content with no <think>...</think> tags. This is enforced inside the TEE proxy on every request — both the OpenAI REST path and the encrypted SDK WebSocket path — so callers see consistent behavior regardless of transport.

Opting in to chain-of-thought

Re-enable thinking with the canonical OpenAI client extension extra_body.chat_template_kwargs.enable_thinking=true. This works with stock openai-python and @openai/client SDKs:

from openai import OpenAI client = OpenAI(api_key="shroud_prod_...", base_url="https://shroud.us/v1") resp = client.chat.completions.create( model="Qwen/Qwen3-32B", messages=[{"role": "user", "content": "What is 2+2?"}], extra_body={"chat_template_kwargs": {"enable_thinking": True}}, ) print(resp.choices[0].message.content) # "<think>\nOkay, so the user is asking ...\n</think>\n\n2 + 2 equals **4**."

Direct HTTP callers add the field at the top level of the request body alongside messages:

curl -X POST https://shroud.us/v1/chat/completions \ -H "Authorization: Bearer shroud_prod_..." \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-32B", "messages": [{"role": "user", "content": "What is 2+2?"}], "chat_template_kwargs": {"enable_thinking": true} }'

When opt-in is active, the response content field still contains raw <think>...</think> tags inline with the answer rather than a separate reasoning_content SSE field. This is a known limitation: it requires --reasoning-parser qwen3 enabled on the worker, which is a follow-up. Strip the tags client-side if you want only the answer:

const answer = content.replace(/<think>[\s\S]*?<\/think>/g, "").trim();

Operator override

Operators can disable the platform-wide default-injection by setting the COCOON_DEFAULT_DISABLE_THINKING_MODELS env var on the TEE proxy — see Default-disable-thinking allowlist for details, including how to broaden the allowlist to other Qwen3 variants.

Available Models

The list of models served at any given time depends on which weights the inference fleet is running. Query GET /v1/models for the live list — only models reported by the inference upstream and active in the catalog are returned.

Model

Parameters

Context

Notes

Qwen/Qwen3-32B

32B

131,072 tokens

General-purpose reasoning, code generation, multi-step tool orchestration. By default returns clean content without <think> tags; opt in to chain-of-thought via extra_body.chat_template_kwargs.enable_thinking=true.

List Models

curl https://shroud.us/v1/models \ -H "Authorization: Bearer shroud_prod_..."

The unprefixed /v1/models returns a union across every enabled Cocoon network, so the same model id may appear once per network with a distinct owned_by value:

{ "object": "list", "data": [ { "id": "Qwen/Qwen3-32B", "object": "model", "owned_by": "cocoon-classic" }, { "id": "Qwen/Qwen3-32B", "object": "model", "owned_by": "cocoon-alpha" } ] }

To list models for a single network, use the network-prefixed form (/cocoon-classic/v1/models or /cocoon-alpha/v1/models). The same prefix works for /v1/chat/completions to pin chat traffic to a specific network. See Cocoon networks for the full route grid and SDK pinning patterns.

The created field that OpenAI's spec includes is intentionally omitted: Cocoon doesn't have a meaningful creation timestamp for a registered model, so synthesizing one would mislead.

Examples

Using curl

curl -X POST https://shroud.us/v1/chat/completions \ -H "Authorization: Bearer shroud_prod_..." \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-32B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256, "stream": true }'

Using Python (OpenAI SDK)

from openai import OpenAI client = OpenAI( api_key="shroud_prod_...", base_url="https://shroud.us/v1" ) response = client.chat.completions.create( model="Qwen/Qwen3-32B", messages=[{"role": "user", "content": "Hello!"}], max_tokens=256, stream=True ) for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")

Using TypeScript (OpenAI SDK)

import OpenAI from 'openai'; const client = new OpenAI({ apiKey: 'shroud_prod_...', baseURL: 'https://shroud.us/v1', }); const stream = await client.chat.completions.create({ model: 'Qwen/Qwen3-32B', messages: [{ role: 'user', content: 'Hello!' }], max_tokens: 256, stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content ?? ''); }

Errors

Errors on /v1/chat/completions and /v1/models use the OpenAI error envelope, not the SHROUD canonical envelope:

{ "error": { "message": "<human readable>", "type": "<server_error | invalid_request_error | ...>", "code": "<short code>" } }

The most common statuses are summarised below. Auth-layer errors (401, 429 from the rate limiter, 429 from the CU limiter) are emitted by middleware that sits in front of this handler and use the SHROUD canonical envelope — see Error reference.

Status

When

Notes

400

Missing or invalid model/messages, request body over 8 MB

OpenAI envelope. code is model_required, messages_required, model_not_found, or request_too_large.

502

Upstream worker connect or stream error

OpenAI envelope. type: server_error, code: upstream_error.

503

/v1/models upstream-unavailable fallback

OpenAI envelope. The response sets Retry-After: 5. Wait at least 5 seconds before retrying. The chat-completions handler does not currently emit this status; it is specific to model-listing failures.

Mid-stream errors after data: events have started follow the OpenAI mid-stream-error shape — see SSE streaming protocol — Mid-stream error events.

Differences from Cocoon SDK

Feature

OpenAI-Compatible API

Cocoon SDK

Encryption

Transport-level (TLS)

End-to-end (AES-256-GCM)

TEE attestation

Not available

Full TDX quote verification

Selective disclosure

Not available

Configurable

Protocol

HTTP/SSE

WebSocket

Client libraries

Any OpenAI-compatible client

Shroud SDK (Go/TypeScript)

Last modified: 08 May 2026