Connection URL
Authentication
The audio WebSocket requires the computer’s password as thetoken query parameter. Retrieve it from the Get VNC Password endpoint before connecting.
Connections without a valid token are rejected with close code 4401.
Query Parameters
Computer password. Retrieve via the Get VNC Password endpoint.
Audio sample rate in Hz.
Number of audio channels.
1 for mono, 2 for stereo.Audio Format
| Property | Value |
|---|---|
| Encoding | s16le (signed 16-bit little-endian PCM) |
| Sample rate | 24,000 Hz |
| Channels | 1 (mono) |
| Frame duration | 20 ms |
| Frame size | 960 bytes |
| Bitrate | ~48 KB/s |
Protocol
Connection Flow
- Client connects with
?token=parameter - Server validates the token
- Server sends a JSON text frame confirming the audio configuration:
- Server continuously sends binary frames containing raw PCM audio data
- Client decodes and plays via Web Audio API or any PCM-capable player
Client → Server Messages
stop
stop
Stop audio capture and close the connection.
ping
ping
Send a heartbeat ping to keep the connection alive.
Server → Client Messages
started
started
Binary frames
Binary frames
Raw PCM audio data. Each binary frame contains signed 16-bit little-endian samples.At the default 24 kHz mono, each 20 ms frame is 960 bytes (480 samples × 2 bytes per sample).
pong
pong
Response to a ping message.
Examples
Playback Tips
Browser Autoplay
Browsers require a user gesture (click, keypress) before
AudioContext can play. Create the context inside a click handler, or call ctx.resume() after a user interaction.Drift Correction
Schedule buffers slightly ahead of real time and reset when drift exceeds ~150 ms. This prevents gaps and keeps latency below 50 ms.
Heartbeat
Send periodic
ping messages (every 30 seconds) to keep the connection alive and detect disconnections early.Sample Rate
The default 24 kHz mono is optimized for voice and system sounds. Pass
sample_rate=48000 for higher fidelity if needed.Audio streaming requires the computer to be running. If the computer is stopped, the WebSocket connection will be rejected. Audio is captured from the VM’s virtual PulseAudio speaker — any sound the desktop produces (browser media, system alerts, application audio) is streamed.