Cocoon Go SDK
Overview
The Go SDK provides a high-level client for Cocoon confidential inference. It performs an X25519 key exchange with the in-TEE proxy, encrypts every prompt and response chunk with AES-GCM under the session key, verifies the proxy's TDX attestation against a built-in allow-list of measurements, and surfaces TEE-signed token usage at the end of each request. Streaming responses arrive as decrypted Chunk values; the Usage struct returned by stream.Usage() carries an Ed25519 signature the SDK validates locally.
The SDK has no retry behaviour and no client-side rate limiting. Callers are expected to wrap calls in their own backoff. See Retry policy below.
Installation
Migration from OpenAI
The SDK does not expose an OpenAI-compatible surface. If you have existing code talking to api.openai.com/v1/chat/completions and want to keep that shape, point your OpenAI client at the Shroud gateway's /v1/chat/completions endpoint instead — the gateway accepts the OpenAI request body and returns OpenAI responses. See Migrate from OpenAI and the OpenAI-compatible API reference.
Use the Cocoon Go SDK when you need end-to-end encryption between your process and the TEE, attestation verification, and TEE-signed usage receipts. The HTTP path terminates TLS at the gateway and does not provide any of those properties.
Quick start
Listing models
The Model struct exposes only {ID, Object, OwnedBy}. Live worker counts and per-model coefficient ranges are not part of Model; fetch them via client.ListWorkers(ctx) — see Speed tiers and coefficients.
Listing live workers
ListWorkers returns []WorkerType. Each entry has a model Name and a slice of WorkerInstance values exposing Coefficient, ActiveRequests, and MaxActiveRequests.
Selecting a Cocoon network
The default paths follow the deployment's default_network. To pin the client to a specific network — for example cocoon-classic regardless of the deployment default — override every Cocoon-bound path with the network-prefixed form:
See Cocoon networks for the full route grid and the list of cases where pinning is the right call.
Streaming
Client.Inference always returns a *Stream. Iterate with Next(), then check Err() and read Usage() once iteration ends.
Always call stream.Err() after the loop. A network drop mid-stream ends iteration cleanly but surfaces only through Err(); without the check, a truncated response looks like a complete one.
Selective disclosure
Control which usage fields the TEE reveals to the gateway. By default the TEE returns only token totals; opt in to per-request fields by listing them in Disclose.
Available fields:
Field | Description |
|---|---|
| Number of input tokens |
| Number of cached tokens |
| Number of output tokens |
| Number of reasoning tokens |
| Total token count |
| Model name |
| Proxy start timestamp |
| Proxy end timestamp |
| Worker start timestamp |
| Worker end timestamp |
| Worker debug info |
| Proxy debug info |
Speed tiers and coefficients
Each Cocoon worker advertises a coefficient: a relative-cost weight applied to the per-token CU price. Lower coefficients are slower but cheaper; higher coefficients are faster and more expensive. The SDK lets you bound or pin the coefficient either client-wide or per request.
Three named tiers map to live worker statistics for the requested model: TierBase resolves to the minimum coefficient, TierStandard to the median, TierPriority to the maximum. The SDK fetches /v1/cocoon/workers, computes the bucket from live data, and writes the resolved integer into the encrypted request.
Precedence (highest first):
InferenceRequest.MaxCoefficient(explicit integer wins).InferenceRequest.SpeedTier(resolved from live workers).Client-level
WithMaxCoefficient.Client-level
WithSpeedTier.Absent (TEE applies its own fallback).
If /v1/cocoon/workers is unreachable when the SDK needs to resolve a tier, Inference returns an error before opening the WebSocket. See the Billing reference for the coefficient-to-CU formula.
Reasoning content / chat-template overrides
The Go SDK does not currently expose chat_template_kwargs. For reasoning-content opt-in (enable_thinking and similar vLLM/sglang knobs) use the OpenAI HTTP path or the TypeScript SDK, which exposes chatTemplateKwargs.
Attestation verification
By default, the SDK verifies Intel TDX attestation quotes against a built-in allow-list of Cocoon proxy image measurements. When the quote is missing, malformed, signed by an untrusted measurement, or its report_data does not bind to the proxy public key the gateway returned, Inference returns an AttestationError-flavoured error and the WebSocket is closed before any prompt is sent.
After the stream completes, stream.Usage().Verified reports whether the per-usage Ed25519 signature checked out against the session's TEE public key. The SDK does not currently fail closed when verification fails — the stream still returns content and Verified == false. Treat unverified usage as untrusted and reject the response in your own code:
For the wire-level details see How attestation works.
Error handling
Inference returns descriptive errors wrapped through fmt.Errorf. Callers should branch on the error chain rather than string match. The session-setup phase fails with errors prefixed by session error, attestation verification failed, derive shared key, or websocket connect. During streaming, transport faults and TEE errors surface through stream.Err().
For HTTP-status-coded gateway errors (auth, rate limits, CU limits) see Error reference.
Retry policy
The SDK does not retry. Each Inference call performs exactly one WebSocket dial and one session setup; ListModels and ListWorkers each issue a single HTTP request. Transient drops, 503 responses, and DNS hiccups all surface to the caller as errors with no retry attempt.
Wrap calls in your own backoff. See the Production guide for the recommended pattern (exponential backoff with jitter, capped retry budget, respect for Retry-After).
API reference
Constructor
baseURL is the Cocoon WebSocket endpoint, typically wss://.... ws:// is accepted for local development. The HTTP-shim endpoints (/v1/models, /v1/cocoon/workers) are derived by replacing the scheme with https:///http://.
Client options
Option | Description |
|---|---|
| Bearer token used on every request. |
| Custom HTTP client for |
| Override the path used by |
| Override the WebSocket inference path (default |
| Override the path used by |
| Custom TDX verification policy. Pass |
| Default |
| Default speed tier resolved client-side from |
| Default per-request timeout in seconds passed to the TEE. |
Client methods
Method | Returns | Description |
|---|---|---|
|
| Fetch available models from the OpenAI-shim |
|
| Fetch live worker types from |
|
| Open a TEE-encrypted streaming session. |
InferenceRequest
MaxCoefficient, SpeedTier, and Timeout carry json:"-" tags; the SDK builds the encrypted shroud.* wire object inside Inference and these fields never appear in the marshalled request body.
Stream
Method | Description |
|---|---|
| Read next chunk; returns |
| TEE-signed token usage after stream completes. |
| Disclosure fields negotiated with the TEE. |
| Error that stopped the stream, |
| Close the WebSocket connection. |
Usage
Model
OwnedBy carries the network name (cocoon-classic, cocoon-alpha), not a hard-coded cocoon literal. Worker counts and coefficient ranges are exposed through ListWorkers, not on Model itself.