Skip to main content

Overview

The listen verb streams audio in real-time over a websocket connection to a third-party websocket server. Stream may be one-way only or bidirectional.

Example

{
  "verb": "listen",
  "url": "wss://myrecorder.example.com/calls",
  "mixType": "stereo"
}

Parameters

ParameterTypeRequiredDefaultDescription
actionHookstringYesWebhook to invoke when the listen operation ends. The information will include the duration of the audio stream and a ‘digits’ property if the recording was terminated by a DTMF key.
bidirectionalAudio.enabledbooleanNotrueIf true, enable bidirectional audio.
bidirectionalAudio.sampleRatenumberNoThe sample rate of PCM audio sent back to graine over the websocket.
bidirectionalAudio.streamingbooleanNofalseIf true, enable streaming of audio from your application back to Graine (and the remote caller).
disableBidirectionalAudiobooleanNoIf true, disable bidirectional audio (deprecated; use bidirectionalAudio.enabled: false).
finishOnKeystringNoThe set of digits that can end the listen action if any one of them is detected.
maxLengthnumberNoThe maximum length of the listened audio stream, in seconds. The websocket connection will be closed if this duration is reached.
metadataobjectNoAdditional user data to add to the JSON payload sent to the remote server when the WebSocket connection is first established.
mixTypestringNomono"mono" (send a single channel), "stereo" (send dual channels of both calls in a bridge), or "mixed" (send audio from both calls in a bridge in a single mixed audio stream).
passDtmfbooleanNofalseIf true, any DTMF digits detected from the caller will be passed over the WebSocket as text frames in JSON format.
playBeepbooleanNofalseWhether to play a beep at the start of the listen operation.
sampleRatenumberNo8000Sample rate of the PCM audio that will be sent from graine to remote server. Allowable values: 8000, 16000, 24000, 48000, or 64000.
timeoutnumberNoThe number of seconds of silence that terminates the listen operation.
transcribeobjectNoA nested transcribe verb.
urlstringYesThe URL of the remote server to connect to; should be a ws or wss URL.
wsAuth.passwordstringNoHTTP basic auth password to use on the WebSocket connection, if desired.
wsAuth.usernamestringNoHTTP basic auth username to use on the WebSocket connection, if desired.

Audio format

Audio is sent over the websocket in linear 16-bit PCM encoding, using the sample rate specified in the sampleRate property. The audio is sent in binary frames over the websocket connection. The audio sent back from the server is expected to also be linear16 PCM encoded audio, with a sample rate specified in the bidirectionalAudio.sampleRate property. If the bidirectionalAudio.streaming property is set to true, then the audio sent back from the server should be sent as binary frames over the websocket connection and will be streamed to the caller. Otherwise, audio that is sent back is expected to be sent as JSON text frames containing base64-encoded audio content that will be buffered and then played out to the caller once it is received in full.

Initial metadata

One text frame is sent immediately after the websocket connection is established. This text frame contains a JSON string with all of the call attributes normally sent on an HTTP request (e.g. callSid, etc), plus sampleRate and mixType properties describing the audio sample rate and stream(s). Additional metadata can also be added to this payload using the metadata property. Once the initial text frame containing the metadata has been sent, the remote side should expect to receive only binary frames, containing audio.

Passing DTMF

Any DTMF digits entered by the far end party on the call can optionally be passed to the websocket server as JSON text frames by setting the passDtmf property to true. Each DTMF entry is reported separately in a payload that contains the specific DTMF key that was entered, as well as the duration which is reported in RTP timestamp units. The payload that is sent will look like this:
{
  "event": "dtmf",
  "dtmf": "2",
  "duration": "1600"
}

Bidirectional audio

Audio can also be sent back over the websocket to Graine. This audio, if supplied, will be played out to the caller.
Bidirectional audio is not supported when the listen is nested in the context of a dial verb.
There are two separate modes for bidirectional audio:
  • non-streaming: where you provide a full base64-encoded audio file as JSON text frames
  • streaming: where you stream audio as L16 PCM raw audio as binary frames

Non-streaming

The far-end websocket server supplies bidirectional audio by sending a JSON text frame over the websocket connection:
{
  "type": "playAudio",
  "data": {
    "audioContent": "base64-encoded content..",
    "audioContentType": "raw",
    "sampleRate": "16000"
  }
}
In the example above, raw (headerless) audio is sent. The audio must be 16-bit PCM encoded audio, with a configurable sample rate of either 8000, 16000, 24000, 32000, 48000, or 64000 kHz. Alternatively, a wave file format can be supplied by using type “wav” (or “wave”), and in this case no sampleRate property is needed. In all cases, the audio must be base64 encoded when sent over the socket. If multiple playAudio commands are sent before the first has finished playing they will be queued and played in order. You may have up to 10 queued playAudio commands at any time. Once a playAudio command has finished playing out the audio, a playDone json text frame will be sent over the websocket connection:
{
  "type": "playDone"
}
A killAudio command can also be sent by the websocket server to stop the playout of audio that was started via a previous playAudio command:
{
  "type": "killAudio"
}
To end the listen, the websocket can send a disconnect command:
{
  "type": "disconnect"
}

Streaming

To enable bidirectional audio, you must explicitly enable it in the listen verb with the streaming property:
{
  "verb": "listen",
  "bidirectionalAudio": {
    "enabled": true,
    "streaming": true,
    "sampleRate": 8000
  }
}
Your application should then send binary frames of linear-16 PCM raw data with the specified sample rate over the websocket connection. You can specify both the sample rate that you want to receive over the websocket as well as the sample rate that you want to send back audio, and they do not need to be the same. In the example below, we choose to receive 8k sampling but send back 16K sampling. You can send any length of frame and Graine will buffer the received audio to play it out at the correct sample rate; we recommend sending a fixed length message (320 bytes at 8kHz, 640 bytes at 16kHz). Each sample is 16-bit therefore takes 2 bytes, so your frames should always be an even number of bytes to ensure the best playback quality.
{
  "verb": "listen",
  "sampleRate": 8000,
  "bidirectionalAudio": {
    "enabled": true,
    "streaming": true,
    "sampleRate": 16000
  }
}

Commands

You can send the following commands over the websocket as JSON frames:
  • disconnect — Close the websocket from the Graine side and end the listen verb
  • killAudio — Stop any playing/buffered audio from bidirectional socket
  • mark — Synchronize with playout (see below)
  • clearMarks — Clear tracked marks

disconnect

{
  "type": "disconnect"
}
This causes the websocket to be closed from the Graine side, and the associated listen verb to end.

killAudio

{
  "type": "killAudio"
}
This causes any audio that is playing out from the bidirectional socket as well as any buffered audio to be flushed.

mark

{
  "type": "mark",
  "data": {
    "name": "my-mark-1"
  }
}
You can send a mark command if you want to synchronize activities on your end with the playout of the audio stream that you have provided. When that point in the audio stream is later reached during playback, you will get a matching JSON frame back over the websocket:
{
  "type": "mark",
  "data": {
    "name": "my-mark-1",
    "event": "playout"
  }
}
The event will contain either playout or cleared depending on whether the audio stream reached the mark during playout or the mark was never played out due to a killAudio command.

clearMarks

{
  "type": "clearMarks"
}
This command clears (removes) any audio marks that are being tracked. When you remove the marks in this way, you will not receive mark events for the removed marks.