OpenAI-Compatible API

Endpoints

Chat Completions

POST /v1/chat/completions
Authorization: Bearer shroud_prod_...
Content-Type: application/json

List Models

GET /v1/models
Authorization: Bearer shroud_prod_...

Request Format

The request format follows the OpenAI Chat Completions API:

{
  "model": "Qwen/Qwen3-32B",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain zero-knowledge proofs"}
  ],
  "max_tokens": 512,
  "stream": true
}

Parameters

Parameter	Type	Required	Description
`model`	string	Yes	Model ID (e.g., `Qwen/Qwen3-32B`)
`messages`	array	Yes	Array of chat messages
`max_tokens`	integer	No	Maximum tokens to generate
`stream`	boolean	No	Enable streaming responses (default: false)
`stream_options.include_usage`	boolean	No	When `stream` is `true`, ask for a final usage chunk before `[DONE]`. See SSE streaming protocol.
`chat_template_kwargs`	object	No	Pass-through hook for the model's chat template. Used to opt in to chain-of-thought (`{"enable_thinking": true}`). See Reasoning content.

Message Format

Field	Type	Description
`role`	string	`system`, `user`, or `assistant`
`content`	string or array	Message content. See note on multimodal arrays below.

Response Format

Non-Streaming

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "Qwen/Qwen3-32B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Zero-knowledge proofs are..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 128,
    "total_tokens": 143
  }
}

Streaming (SSE)

When stream: true, the response uses Server-Sent Events. The wire format and parsing rules — including the optional usage chunk emitted when stream_options.include_usage: true is set — are documented in detail in SSE streaming protocol.

data: {"choices":[{"delta":{"role":"assistant"},"index":0}]}

data: {"choices":[{"delta":{"content":"Zero"},"index":0}]}

data: {"choices":[{"delta":{"content":"-knowledge"},"index":0}]}

data: {"choices":[{"delta":{"content":" proofs"},"index":0}]}

data: {"choices":[{"delta":{"reasoning_content":"Let me think..."},"index":0}]}

data: [DONE]

The delta object may contain:

content — generated text. For reasoning-capable models like Qwen3-32B this is clean answer text by default; opt in to chain-of-thought via chat_template_kwargs.enable_thinking=true and the response will include raw <think>...</think> tags inline (see Reasoning content and <think> tags below).
reasoning_content — chain-of-thought reasoning (if supported by the model)

Reasoning content and `<think>` tags

Qwen/Qwen3-32B is a reasoning-capable model. By default, Shroud disables chain-of-thought generation for this model so /v1/chat/completions returns clean answer text in content with no <think>...</think> tags. This is enforced inside the TEE proxy on every request — both the OpenAI REST path and the encrypted SDK WebSocket path — so callers see consistent behavior regardless of transport.

Opting in to chain-of-thought

Re-enable thinking with the canonical OpenAI client extension extra_body.chat_template_kwargs.enable_thinking=true. This works with stock openai-python and @openai/client SDKs:

from openai import OpenAI

client = OpenAI(api_key="shroud_prod_...", base_url="https://shroud.us/v1")

resp = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print(resp.choices[0].message.content)
# "<think>\nOkay, so the user is asking ...\n</think>\n\n2 + 2 equals **4**."

Direct HTTP callers add the field at the top level of the request body alongside messages:

curl -X POST https://shroud.us/v1/chat/completions \
  -H "Authorization: Bearer shroud_prod_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "chat_template_kwargs": {"enable_thinking": true}
  }'

When opt-in is active, the response content field still contains raw <think>...</think> tags inline with the answer rather than a separate reasoning_content SSE field. This is a known limitation: it requires --reasoning-parser qwen3 enabled on the worker, which is a follow-up. Strip the tags client-side if you want only the answer:

const answer = content.replace(/<think>[\s\S]*?<\/think>/g, "").trim();

Operator override

Operators can disable the platform-wide default-injection by setting the COCOON_DEFAULT_DISABLE_THINKING_MODELS env var on the TEE proxy — see Default-disable-thinking allowlist for details, including how to broaden the allowlist to other Qwen3 variants.

Available Models

The list of models served at any given time depends on which weights the inference fleet is running. Query GET /v1/models for the live list — only models reported by the inference upstream and active in the catalog are returned.

Model	Parameters	Context	Notes
`Qwen/Qwen3-32B`	32B	131,072 tokens	General-purpose reasoning, code generation, multi-step tool orchestration. By default returns clean content without `<think>` tags; opt in to chain-of-thought via `extra_body.chat_template_kwargs.enable_thinking=true`.

List Models

curl https://shroud.us/v1/models \
  -H "Authorization: Bearer shroud_prod_..."

The unprefixed /v1/models returns a union across every enabled Cocoon network, so the same model id may appear once per network with a distinct owned_by value:

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen3-32B",
      "object": "model",
      "owned_by": "cocoon-classic"
    },
    {
      "id": "Qwen/Qwen3-32B",
      "object": "model",
      "owned_by": "cocoon-alpha"
    }
  ]
}

To list models for a single network, use the network-prefixed form (/cocoon-classic/v1/models or /cocoon-alpha/v1/models). The same prefix works for /v1/chat/completions to pin chat traffic to a specific network. See Cocoon networks for the full route grid and SDK pinning patterns.

The created field that OpenAI's spec includes is intentionally omitted: Cocoon doesn't have a meaningful creation timestamp for a registered model, so synthesizing one would mislead.

Examples

Using curl

curl -X POST https://shroud.us/v1/chat/completions \
  -H "Authorization: Bearer shroud_prod_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256,
    "stream": true
  }'

Using Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="shroud_prod_...",
    base_url="https://shroud.us/v1"
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=256,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Using TypeScript (OpenAI SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'shroud_prod_...',
  baseURL: 'https://shroud.us/v1',
});

const stream = await client.chat.completions.create({
  model: 'Qwen/Qwen3-32B',
  messages: [{ role: 'user', content: 'Hello!' }],
  max_tokens: 256,
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}

Errors

Errors on /v1/chat/completions and /v1/models use the OpenAI error envelope, not the SHROUD canonical envelope:

{
  "error": {
    "message": "<human readable>",
    "type": "<server_error | invalid_request_error | ...>",
    "code": "<short code>"
  }
}

The most common statuses are summarised below. Auth-layer errors (401, 429 from the rate limiter, 429 from the CU limiter) are emitted by middleware that sits in front of this handler and use the SHROUD canonical envelope — see Error reference.

Status	When	Notes
`400`	Missing or invalid `model`/`messages`, request body over 8 MB	OpenAI envelope. `code` is `model_required`, `messages_required`, `model_not_found`, or `request_too_large`.
`502`	Upstream worker connect or stream error	OpenAI envelope. `type: server_error`, `code: upstream_error`.
`503`	`/v1/models` upstream-unavailable fallback	OpenAI envelope. The response sets `Retry-After: 5`. Wait at least 5 seconds before retrying. The chat-completions handler does not currently emit this status; it is specific to model-listing failures.

Mid-stream errors after data: events have started follow the OpenAI mid-stream-error shape — see SSE streaming protocol — Mid-stream error events.

Differences from Cocoon SDK

Feature	OpenAI-Compatible API	Cocoon SDK
Encryption	Transport-level (TLS)	End-to-end (AES-256-GCM)
TEE attestation	Not available	Full TDX quote verification
Selective disclosure	Not available	Configurable
Protocol	HTTP/SSE	WebSocket
Client libraries	Any OpenAI-compatible client	Shroud SDK (Go/TypeScript)

Endpoints

Chat Completions

List Models

Request Format

Parameters

Message Format

Response Format

Non-Streaming

Streaming (SSE)

Reasoning content and <think> tags

Opting in to chain-of-thought

Operator override

Available Models

List Models

Examples

Using curl

Using Python (OpenAI SDK)

Using TypeScript (OpenAI SDK)

Errors

Differences from Cocoon SDK

Reasoning content and `<think>` tags