Shroud provides an OpenAI-compatible chat completions endpoint, allowing you to use existing OpenAI client libraries and tools with Shroud's confidential inference.
Endpoints
Chat Completions
POST /v1/chat/completions
Authorization: Bearer shroud_prod_...
Content-Type: application/json
List Models
GET /v1/models
Authorization: Bearer shroud_prod_...
Request Format
The request format follows the OpenAI Chat Completions API:
When stream: true, the response uses Server-Sent Events. The wire format and parsing rules — including the optional usage chunk emitted when stream_options.include_usage: true is set — are documented in detail in SSE streaming protocol.
content — generated text. For reasoning-capable models like Qwen3-32B this is clean answer text by default; opt in to chain-of-thought via chat_template_kwargs.enable_thinking=true and the response will include raw <think>...</think> tags inline (see Reasoning content and <think> tags below).
reasoning_content — chain-of-thought reasoning (if supported by the model)
Reasoning content and <think> tags
Qwen/Qwen3-32B is a reasoning-capable model. By default, Shroud disables chain-of-thought generation for this model so /v1/chat/completions returns clean answer text in content with no <think>...</think> tags. This is enforced inside the TEE proxy on every request — both the OpenAI REST path and the encrypted SDK WebSocket path — so callers see consistent behavior regardless of transport.
Opting in to chain-of-thought
Re-enable thinking with the canonical OpenAI client extension extra_body.chat_template_kwargs.enable_thinking=true. This works with stock openai-python and @openai/client SDKs:
from openai import OpenAI
client = OpenAI(api_key="shroud_prod_...", base_url="https://shroud.us/v1")
resp = client.chat.completions.create(
model="Qwen/Qwen3-32B",
messages=[{"role": "user", "content": "What is 2+2?"}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(resp.choices[0].message.content)
# "<think>\nOkay, so the user is asking ...\n</think>\n\n2 + 2 equals **4**."
Direct HTTP callers add the field at the top level of the request body alongside messages:
When opt-in is active, the response content field still contains raw <think>...</think> tags inline with the answer rather than a separate reasoning_content SSE field. This is a known limitation: it requires --reasoning-parser qwen3 enabled on the worker, which is a follow-up. Strip the tags client-side if you want only the answer:
Operators can disable the platform-wide default-injection by setting the COCOON_DEFAULT_DISABLE_THINKING_MODELS env var on the TEE proxy — see Default-disable-thinking allowlist for details, including how to broaden the allowlist to other Qwen3 variants.
Available Models
The list of models served at any given time depends on which weights the inference fleet is running. Query GET /v1/models for the live list — only models reported by the inference upstream and active in the catalog are returned.
Model
Parameters
Context
Notes
Qwen/Qwen3-32B
32B
131,072 tokens
General-purpose reasoning, code generation, multi-step tool orchestration. By default returns clean content without <think> tags; opt in to chain-of-thought via extra_body.chat_template_kwargs.enable_thinking=true.
The unprefixed /v1/models returns a union across every enabled Cocoon network, so the same model id may appear once per network with a distinct owned_by value:
To list models for a single network, use the network-prefixed form (/cocoon-classic/v1/models or /cocoon-alpha/v1/models). The same prefix works for /v1/chat/completions to pin chat traffic to a specific network. See Cocoon networks for the full route grid and SDK pinning patterns.
The created field that OpenAI's spec includes is intentionally omitted: Cocoon doesn't have a meaningful creation timestamp for a registered model, so synthesizing one would mislead.
The most common statuses are summarised below. Auth-layer errors (401, 429 from the rate limiter, 429 from the CU limiter) are emitted by middleware that sits in front of this handler and use the SHROUD canonical envelope — see Error reference.
Status
When
Notes
400
Missing or invalid model/messages, request body over 8 MB
OpenAI envelope. code is model_required, messages_required, model_not_found, or request_too_large.
OpenAI envelope. The response sets Retry-After: 5. Wait at least 5 seconds before retrying. The chat-completions handler does not currently emit this status; it is specific to model-listing failures.