Streaming Transcription

Attendi provides a streaming transcription endpoint that can be used to transcribe audio in real-time. This is useful to provide immediate feedback to the user and reduce the time the user has to wait for the transcription to be completed. The streaming endpoint can be communicated with using a WebSocket connection.

The transcription endpoint communicates with the client using a JSON-based protocol. The client sends audio data to the server in chunks, and the server sends back a list of actions such as replace_text, add_annotation, and more. The client can then use these actions to update the state of the transcription.

Streaming state

The most important concept is the current stream state. This is an object that contains the current transcription text and any annotations that have been added to the transcription. The client can map this state to a visual representation such as a text editor. The state is given by the following interface:

interface AttendiStreamState {
  /**
   * The current text of the transcription.
   */
  text: string;
  /**
   * Can be used to mark specific parts of the transcription as having some
   * specific property, such as being a name or having a high transcription certainty.
   */
  annotations: Annotation[];
}

An Annotation allows specific parts of the text to be marked with a specific property. For example, if a part of the transcript is still uncertain and subject to change, it might be a better user experience if it is displayed with a different color.

We could for example have the following state:

const state: AttendiStreamState = {
  text: "Hi, my name is John Doe.",
  annotations: [
    {
      startCharacterIndex: 15,
      endCharacterIndex: 23,
      type: "transcription_tentative",
    },
  ],
};

This would allow you to display the text "John Doe" in a different color, to indicate that it is still tentative. While this is not strictly necessary, it can be useful to provide a better user experience.

AttendiTranscribeStream

The client SDK provides an abstraction AttendiTranscribeStream that knows how to update the state based on the actions received from the server. This means that you don't have to worry about the low-level details of the protocol, i.e. how to update the text and annotations based on the transcription server's messages, and can focus instead on how to display the transcription to the user. It can be used as follows:

// Create a new stream.
this.stream = createStream();

// A message from the server comes in. Update the stream state. Note that a *new* stream is returned,
// so you have to update the reference to the stream.
this.stream = this.stream.receiveActions(message.actions);

// Get the current state of the stream.
const { text, annotations } = this.stream.state;

console.log(text); // "Hi, my name is John Doe."
console.log(annotations); // [{ startCharacterIndex: 15, endCharacterIndex: 23, type: "transcription_tentative" }]

// Update the UI based on the new state.
textField.value = text;

AttendiStreamingTranscribePlugin

The AttendiStreamingTranscribePlugin is an abstraction over the WebSocket connection and the AttendiTranscribeStream. It handles setting up the WebSocket connection and sending audio data to the server when recording is started, and updating the stream state based on the server's messages.

The plugin exposes the stream state to the client using the onStreamUpdated, onStreamCompleted, and abnormalClosure callbacks. The onStreamUpdated callback is called whenever the stream is updated. When recording is stopped, either through user interaction or programmatically, the final result of the transcription is exposed through the onStreamCompleted callback, again as an AttendiTranscribeStream object. This object contains the final text and annotations. You could use this to persist the transcription to a text field for example.

The onAbnormalClosure callback is called when the connection closes abnormally. This can be used to handle the situation where the connection closes unexpectedly, for example to show an error message to the user, and persist the transcription up to that point, so no data is lost. It also exposes the AttendiTranscribeStream object.

Example:

const mic = document.querySelector("attendi-microphone");

const attendiTranscribePluginConfig = {
  apiURL: "https://api.attendi.nl", // Doesn't have to be specified usually.
  customerKey: <customerKey>,
  userId: "userId",
  config: {
    model: "DistrictCare", // Doesn't have to be specified usually.
  },
  metadata: {
    microphoneLocation: "report-page-big-textfield",
    userAgent: navigator.userAgent, // Doesn't have to be specified usually.
  },
  unitId: "unitId",
};

function onStreamUpdated(updatedStream) {
  const streamingState = updatedStream.state;
  const { text, annotations } = streamingState;

  // Build the streaming content for the editor using the text and annotations.
  const streamingContent = buildStreamContent(text, annotations);

  // Add the streaming content to the editor.
  addToEditor(streamingContent);
}

function onStreamCompleted(completedStream) {
  const streamingState = completedStream.state;
  const { text, annotations } = streamingState;

  const completedStreamingContent = buildStreamContent(text, annotations);

  // Remove the existing streaming content node, since we are done with this
  // stream.
  removeExistingStreamingContent();

  // Insert the completed streaming content at the end of the editor.
  insertCompletedStreamingContent(completedStreamingContent);
}

function onAbnormalClosure(stream) {
 const { text, annotations } = stream;
 // Build the streaming content for the editor using the text and annotations.
 const streamingContent = buildStreamContent(text, annotations);

 // Remove the existing streaming content node, since we are done with this
 // stream.
 removeExistingStreamingContent();

 // Insert the completed streaming content at the end of the editor.
 insertCompletedStreamingContent(streamingContent);

 // Show an error message to the user.
 showPopover("Connection closed unexpectedly. Please try again.");
}

mic.plugins.add(
  "attendi-streaming-transcribe",
  new AttendiStreamingTranscribePlugin({
    transcribeConfig: attendiTranscribePluginConfig,
    onStreamUpdated,
    onStreamCompleted,
    onAbnormalClosure
  }),
);

See this package's example_index.html for a more detailed example, using the tiptap editor.

By default, the plugin fetches an authentication token using the Attendi /identity/authenticate endpoint, which is used to authenticate with the transcription endpoint. A WebSocket connection is then opened with Attendi's streaming transcription endpoint. This behavior can be changed by passing a getAuthenticationToken function in the constructor.

This plugin connects to Attendi's transcription API directly in the front end. If this doesn't suit your use-case, this plugin can be used as an example to implement a similar plugin. Under the hood, it uses a WebSocketAudioStreamingAdapter to handle the WebSocket connection, which abstracts away the low-level details of the WebSocket connection. This can be used to make implementing a similar plugin easier. See its documenation for more information.

Streaming Transcription

Streaming state​

AttendiTranscribeStream​

AttendiStreamingTranscribePlugin​

Streaming state

AttendiTranscribeStream

AttendiStreamingTranscribePlugin