SSE streaming protocol
Shroud streams chat completions on the OpenAI-compatible HTTP path (POST /v1/chat/completions with stream: true) using Server-Sent Events in OpenAI's exact wire format. This page describes what arrives on the wire byte-by-byte so a custom client or non-OpenAI HTTP library can consume it correctly.
For end-to-end-encrypted streaming over WebSocket, see the Cocoon SDKs and the Wire protocol. The Cocoon SDK uses a different transport entirely; this page is only about the HTTP SSE path.
Request
stream_options.include_usage: true is the OpenAI extension that asks the server for a usage chunk before the terminator. Shroud honours it on this endpoint.
Response headers
The gateway does not emit SSE comment-line keepalives (: ping\n\n). Idle gaps within a stream are short — the proxy forwards model output as it arrives — but if you place an intermediary that buffers SSE, set X-Accel-Buffering: no (nginx) or the equivalent on your reverse proxy to disable buffering.
Frame format
Every event is a single line beginning with data: and terminated by two newlines (\n\n). The payload is one JSON object; there is no event: field, no id:, and no multi-line data chunks.
The expected sequence of chunks:
Position | Shape | Notes |
|---|---|---|
First chunk |
| The role is sent once. Subsequent deltas omit it. |
Content chunks |
| One per emitted token group; chunk size is determined upstream. |
Final chunk |
|
|
Usage chunk (if requested) |
| Only sent when the request set |
Terminator |
| Always last. After this no more bytes follow on this connection. |
Reasoning content
For reasoning-capable models that produce chain-of-thought output when chat_template_kwargs.enable_thinking is set, an optional reasoning_content field appears on the delta object alongside or in place of content. Treat it as a parallel stream you may render or hide. See OpenAI-compatible API — Reasoning content for the opt-in mechanism.
Mid-stream error events
If inference fails after streaming has started — upstream worker crash, timeout, model error — the gateway emits an error event in the OpenAI mid-stream-error shape and then closes the stream:
Clients that ignore the error field will see a truncated response followed by [DONE] with no finish_reason: "stop". To detect this:
Treat any chunk whose top-level shape is
{"error": {...}}as a failure regardless of the surrounding chunks.After
[DONE], verify you saw at least one chunk withchoices[0].finish_reasonset. Absent it, treat the response as errored.
Parsing rules
A correct SSE parser for Shroud only needs three rules:
Buffer until
\n\n. Each event ends on the blank line. Don't try to parse partial chunks; wait for the boundary.Split on the first
:. Take the suffix; if it is the literal[DONE], the stream is over. OtherwiseJSON.parseit.Stop on
[DONE]. Don't read past it. Close the response.
Most language-standard SSE libraries do all three; the OpenAI SDKs do too. You only need to implement this yourself when you're using a raw HTTP client.
Minimal Python parser
Minimal Node parser
Cancellation
Cancel a stream by closing the underlying TCP connection — close the HTTP response, abort the fetch, or cancel the request context. The gateway notices the broken pipe on its next write and tears down the upstream inference task. There is no protocol-level cancel frame.
Stock OpenAI clients map this to:
Python
openai—response.close()orwith client.chat.completions.stream(...) as stream:'s scope exit.Node
openai—for await ... ofloop break, orcontroller.abort()on theAbortControllerpassed insignal: ....langchain-openai—astreamcancellation propagates throughasyncio.CancelledError.
Clean cancellation will still bill for tokens already generated; the TEE-reported usage is captured up to the cancel point.
Reconnection
There is no protocol-level resume. SSE's Last-Event-ID is not honoured on this endpoint — the gateway does not assign event ids, and the upstream inference call is one-shot per request. If a streaming response drops mid-flight:
Treat any partial output you've collected as untrusted and discard it.
Reissue the request with the same
messagesand a newrequest_idif you're correlating on your side.Honour
Retry-Afterif the failure was a503/429— see Production guide — Retry & backoff.
If your application needs at-most-once-emitted-token semantics, implement an application-level dedupe (idempotency key + cached response) at your call site. Inference paths don't accept the Idempotency-Key header; see Production guide — Idempotency.
Common gotchas
A reverse proxy buffers the stream. nginx defaults buffer upstream responses; chunks arrive in bursts at the client. Set
X-Accel-Buffering: noon the response (orproxy_buffering offin the nginx config block) to disable. CDN edges typically need the same toggle.HTTP/2 frames sized weirdly. Some HTTP/2 clients merge SSE events into a single frame and then surface them in a burst. This is a transport detail; the parsing rules above still hold.
Reading lines with
\r\nline endings. SSE is\n\n(LF LF), not\r\n\r\n. A line-reader configured for HTTP headers may swallow events. Use a byte reader and split on\n\nor use a library that knows SSE.Treating
[DONE]as JSON. It is the literal string[DONE]prefixed bydata:— not a JSON array. Special-case it before passing the payload toJSON.parse.Forgetting to set
stream: true. Without it the endpoint returns a single non-streaming JSON body. With it but missing theAccept: text/event-streamrequest header (some HTTP clients enforce strict accept negotiation), the response will still be SSE — Shroud doesn't gate on theAcceptheader.Unicode boundary splits. A multi-byte UTF-8 codepoint can straddle two TCP reads; decode with a streaming UTF-8 decoder (Python's
iter_lines(decode_unicode=True), JSTextDecoder({ stream: true })) rather than per-readbytes.decode("utf-8").
Related
OpenAI-compatible API — full request/response reference.
Migrate from OpenAI — drop-in client setup.
Production guide — Retry & backoff — when and how to reissue a dropped stream.
Cocoon SDK (Go)· Cocoon SDK (TypeScript) — the end-to-end-encrypted streaming alternative over WebSocket.
Wire protocol — Cocoon WebSocket binary framing.