Cloud real-time transcription considerations

The considerations below apply to RTAA deployments in which real-time transcription is performed in the cloud using the Real-Time Linguistic (remote) transcription service.

The Real-Time Linguistic (remote) engine provides real-time transcription in the cloud. Performing real-time transcription remotely allows for higher transcription accuracy and frees up recorder resources.

When consuming the Real-Time Linguistic (remote) engine, interactions and transcriptions containing sensitive data are streamed outside the enterprise. The service implements important security measures to ensure that all data is transported securely.

Additionally, several bandwidth and network quality considerations must be taken into account to ensure consistent quality of service.

Security

The Real-Time Linguistic (remote) engine implements the following security measures:

The service authenticates the client application as having valid Azure credentials for daemon flow (service-to-service communication), and verifies that the application has permissions to consume the service.
The service endpoint uses the TLS 1.2 protocol, enforcing encrypted communication over a secure web socket. Because a web socket is a bi-directional channel, both inbound and outbound packets (audio and text) are secured during transport. The contents of the packets are not encrypted in addition to the channel encryption.
All communication between internal cloud services uses TLS 1.2.
The service does not store data permanently. During internal routing between service components, some short-term caching of single data packets occurs for up to 1 second. The cache is encrypted using a strong key managed in the NCI vault.

A single interaction is spread into audio packets of 0.5 seconds each, and text packets of a few words each. The packets are multiplexed into a single regional pipeline that serves all customers of that region. All packets of a single interaction are guaranteed to reach the same transcription engine instance, and the transcribed text packets of that interaction are guaranteed to return to the originating web socket. If the service gets disconnected, from that point forward the interaction transcription is abandoned to eliminate the chances of routing errors.

The logs do not contain any sensitive data, such as transcriptions.

The use of a VPN for secure access is optional.

Bandwidth

Transcription accuracy is highly dependent on the audio quality. High audio quality is usually associated with larger audio data size per second.

The Real-Time Linguistic (remote) engine can transcribe audio streamed in either PCM or G.711 format. The Recorder streams audio in G.711 format, which takes up half the amount of bandwidth of PCM.

The total required bandwidth to stream the audio is the number of concurrent interactions multiplied by 19.5 KB/sec. For example, 500 concurrent interactions requires up to 10 MB/second.

The calculation is based on the following assumptions:

G.711 uses 8 KB/channel/second.

Audio is streamed from the Recorder in packets of approximately 0.5 seconds.

Returned text packets are approximately 1 KB every 2 seconds.
The header size per packet is 0.5 KB.
During a stereo interaction, every second, 4 audio packets and 2 halves of text packets are transmitted.
The packet sizes are calculated as follows:
- Audio packet size: 8000 / 2 + 500 = 4500 bytes/0.5 sec/channel
- Text packet size: 1000 + 500 = 1500 bytes/2 sec/channel
PCM format consumes twice the size of G.711, approximately 39 KB/sec/interaction.

Network quality

Network quality has a significant impact on quality of service. Poor network quality increases latency and causes WSS disconnects.

Consider the following network quality factors:

The physical distance between the Recorder and the Real-Time Linguistic (remote) engine is an important factor. The Real-Time Linguistic (remote) engine should be consumed from the same region where the Recorder is deployed.
Latency Round Trip Time (RTT) between a Recorder and the service should not exceed 200 ms.
Poor network quality leads to packet loss, causing the RTT to increase, due to the need to resend packets.
DX (Direct Connect), while not required, promotes reliability and is the only way to get an SLA from AWS.