SSE streaming protocol

Shroud streams chat completions on the OpenAI-compatible HTTP path (POST /v1/chat/completions with stream: true) using Server-Sent Events in OpenAI's exact wire format. This page describes what arrives on the wire byte-by-byte so a custom client or non-OpenAI HTTP library can consume it correctly.

For end-to-end-encrypted streaming over WebSocket, see the Cocoon SDKs and the Wire protocol. The Cocoon SDK uses a different transport entirely; this page is only about the HTTP SSE path.

Request

POST /v1/chat/completions Authorization: Bearer shroud_prod_... Content-Type: application/json
{ "model": "Qwen/Qwen3-32B", "messages": [{"role": "user", "content": "Hello!"}], "stream": true, "stream_options": { "include_usage": true } }

stream_options.include_usage: true is the OpenAI extension that asks the server for a usage chunk before the terminator. Shroud honours it on this endpoint.

Response headers

Content-Type: text/event-stream; charset=utf-8 Cache-Control: no-cache Connection: keep-alive

The gateway does not emit SSE comment-line keepalives (: ping\n\n). Idle gaps within a stream are short — the proxy forwards model output as it arrives — but if you place an intermediary that buffers SSE, set X-Accel-Buffering: no (nginx) or the equivalent on your reverse proxy to disable buffering.

Frame format

Every event is a single line beginning with data: and terminated by two newlines (\n\n). The payload is one JSON object; there is no event: field, no id:, and no multi-line data chunks.

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1715000000,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"role":"assistant"}}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1715000000,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"content":"Hello"}}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1715000000,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{"content":" there"}}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1715000000,"model":"Qwen/Qwen3-32B","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1715000000,"model":"Qwen/Qwen3-32B","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":8,"total_tokens":20}} data: [DONE]

The expected sequence of chunks:

Position

Shape

Notes

First chunk

delta: { role: "assistant" }

The role is sent once. Subsequent deltas omit it.

Content chunks

delta: { content: "..." } (and optionally reasoning_content)

One per emitted token group; chunk size is determined upstream.

Final chunk

delta: {} with finish_reason: "stop" (or length, etc.)

finish_reason arrives only on the final chunk. The delta is empty.

Usage chunk (if requested)

choices: [], usage: { prompt_tokens, completion_tokens, total_tokens }

Only sent when the request set stream_options.include_usage: true.

Terminator

data: [DONE] (literal — not JSON)

Always last. After this no more bytes follow on this connection.

Reasoning content

For reasoning-capable models that produce chain-of-thought output when chat_template_kwargs.enable_thinking is set, an optional reasoning_content field appears on the delta object alongside or in place of content. Treat it as a parallel stream you may render or hide. See OpenAI-compatible API — Reasoning content for the opt-in mechanism.

Mid-stream error events

If inference fails after streaming has started — upstream worker crash, timeout, model error — the gateway emits an error event in the OpenAI mid-stream-error shape and then closes the stream:

data: {"error":{"message":"upstream worker disconnected","type":"upstream_error","code":"inference_failed"}} data: [DONE]

Clients that ignore the error field will see a truncated response followed by [DONE] with no finish_reason: "stop". To detect this:

  • Treat any chunk whose top-level shape is {"error": {...}} as a failure regardless of the surrounding chunks.

  • After [DONE], verify you saw at least one chunk with choices[0].finish_reason set. Absent it, treat the response as errored.

Parsing rules

A correct SSE parser for Shroud only needs three rules:

  1. Buffer until \n\n. Each event ends on the blank line. Don't try to parse partial chunks; wait for the boundary.

  2. Split on the first :. Take the suffix; if it is the literal [DONE], the stream is over. Otherwise JSON.parse it.

  3. Stop on [DONE]. Don't read past it. Close the response.

Most language-standard SSE libraries do all three; the OpenAI SDKs do too. You only need to implement this yourself when you're using a raw HTTP client.

Minimal Python parser

import requests, json resp = requests.post( "https://shroud.us/v1/chat/completions", headers={"Authorization": "Bearer shroud_prod_..."}, json={ "model": "Qwen/Qwen3-32B", "messages": [{"role": "user", "content": "Hello!"}], "stream": true, }, stream=True, ) for line in resp.iter_lines(decode_unicode=True): if not line or not line.startswith("data: "): continue payload = line.removeprefix("data: ") if payload == "[DONE]": break chunk = json.loads(payload) if "error" in chunk: raise RuntimeError(chunk["error"]["message"]) delta = chunk["choices"][0]["delta"] if "content" in delta: print(delta["content"], end="", flush=True)

Minimal Node parser

const resp = await fetch("https://shroud.us/v1/chat/completions", { method: "POST", headers: { "Authorization": "Bearer shroud_prod_...", "Content-Type": "application/json", }, body: JSON.stringify({ model: "Qwen/Qwen3-32B", messages: [{ role: "user", content: "Hello!" }], stream: true, }), }); const reader = resp.body!.getReader(); const decoder = new TextDecoder(); let buf = ""; outer: while (true) { const { value, done } = await reader.read(); if (done) break; buf += decoder.decode(value, { stream: true }); let idx; while ((idx = buf.indexOf("\n\n")) !== -1) { const event = buf.slice(0, idx); buf = buf.slice(idx + 2); if (!event.startsWith("data: ")) continue; const payload = event.slice(6); if (payload === "[DONE]") break outer; const chunk = JSON.parse(payload); if (chunk.error) throw new Error(chunk.error.message); process.stdout.write(chunk.choices[0]?.delta?.content ?? ""); } }

Cancellation

Cancel a stream by closing the underlying TCP connection — close the HTTP response, abort the fetch, or cancel the request context. The gateway notices the broken pipe on its next write and tears down the upstream inference task. There is no protocol-level cancel frame.

Stock OpenAI clients map this to:

  • Python openairesponse.close() or with client.chat.completions.stream(...) as stream:'s scope exit.

  • Node openaifor await ... of loop break, or controller.abort() on the AbortController passed in signal: ....

  • langchain-openaiastream cancellation propagates through asyncio.CancelledError.

Clean cancellation will still bill for tokens already generated; the TEE-reported usage is captured up to the cancel point.

Reconnection

There is no protocol-level resume. SSE's Last-Event-ID is not honoured on this endpoint — the gateway does not assign event ids, and the upstream inference call is one-shot per request. If a streaming response drops mid-flight:

  1. Treat any partial output you've collected as untrusted and discard it.

  2. Reissue the request with the same messages and a new request_id if you're correlating on your side.

  3. Honour Retry-After if the failure was a 503/429 — see Production guide — Retry & backoff.

If your application needs at-most-once-emitted-token semantics, implement an application-level dedupe (idempotency key + cached response) at your call site. Inference paths don't accept the Idempotency-Key header; see Production guide — Idempotency.

Common gotchas

  • A reverse proxy buffers the stream. nginx defaults buffer upstream responses; chunks arrive in bursts at the client. Set X-Accel-Buffering: no on the response (or proxy_buffering off in the nginx config block) to disable. CDN edges typically need the same toggle.

  • HTTP/2 frames sized weirdly. Some HTTP/2 clients merge SSE events into a single frame and then surface them in a burst. This is a transport detail; the parsing rules above still hold.

  • Reading lines with \r\n line endings. SSE is \n\n (LF LF), not \r\n\r\n. A line-reader configured for HTTP headers may swallow events. Use a byte reader and split on \n\n or use a library that knows SSE.

  • Treating [DONE] as JSON. It is the literal string [DONE] prefixed by data: — not a JSON array. Special-case it before passing the payload to JSON.parse.

  • Forgetting to set stream: true. Without it the endpoint returns a single non-streaming JSON body. With it but missing the Accept: text/event-stream request header (some HTTP clients enforce strict accept negotiation), the response will still be SSE — Shroud doesn't gate on the Accept header.

  • Unicode boundary splits. A multi-byte UTF-8 codepoint can straddle two TCP reads; decode with a streaming UTF-8 decoder (Python's iter_lines(decode_unicode=True), JS TextDecoder({ stream: true })) rather than per-read bytes.decode("utf-8").

Last modified: 08 May 2026