[Angular] 使用 AWS Transcribe Streaming 实现文字转录的示例代码

2 年 ago

清, 宇

5 minutes

首先

如题所述，这是一个使用Angular实现实时转录的样例，使用了AWS Transcribe Streaming。

经过多次试验和错误，我查找并尝试了一些内容，比如“开关麦克风”、“音频流的创建方法”、“AWS SDK 的使用方法”等等。我想把它们记录下来，作为一篇文章留存下来。如果能够对遇到类似问题的人提供一些帮助，我将感到非常幸运。

如果有更好的方法或者对文章中的解释、代码有错误的地方，请您提出指正，将不胜感激。

环境

在Angular v14上进行确认

環境バージョン備考Angular CLIv14.2.10$ ng versionAngularv14.2.12同上TypeScriptv4.7.4同上Node.jsv16.19.0$ node --versionnpmv8.19.3$ npm --version@aws-sdk/client-transcribe-streamingv3.226.0package.json で確認microphone-streamv6.0.1同上processv0.11.10同上bufferv6.0.3同上

Angular版本的详细信息（ng version的结果）
$ ng version_ _ ____ _ ___
/ \ _ __ __ _ _ _| | __ _ _ __ / ___| | |_ _|
/ △ \ | ‘_ \ / _` | | | | |/ _` | ‘__| | | | | | |
/ ___ \| | | | (_| | |_| | | (_| | | | |___| |___ | |
/_/ \_\_| |_|\__, |\__,_|_|\__,_|_| \____|_____|___|
|___/

Angular CLI: 14.2.10
Node: 16.19.0
包管理器: npm 8.19.3
操作系统: darwin x64

Angular: 14.2.12
… 动画，通用，编译器，编译器-cli，核心，表单
… 平台浏览器，平台浏览器动态，路由器

包版本
———————————————————
@angular-devkit/architect 0.1402.10
@angular-devkit/build-angular 14.2.10
@angular-devkit/core 14.2.10
@angular-devkit/schematics 14.2.10
@angular/cdk 14.2.7
@angular/cli 14.2.10
@angular/material 14.2.7
@schematics/angular 14.2.10
rxjs 6.6.7
typescript 4.7.4

在Angular v13进行确认

環境バージョン備考Angular CLIv13.2.4$ ng --versionAngularv13.2.3同上TypeScriptv4.5.5同上Node.jsv14.17.0$ node --versionnpmv6.14.13$ npm --version@aws-sdk/client-transcribe-streamingv3.95.0package.json で確認microphone-streamv6.0.1同上processv0.11.10同上bufferv6.0.3同上

Angular 的版本详细信息(ng version 的结果)
$ ng version_ _ ____ _ ___
/ \ _ __ __ _ _ _| | __ _ _ __ / ___| | |_ _|
/ △ \ | ‘_ \ / _` | | | | |/ _` | ‘__| | | | | | |
/ ___ \| | | | (_| | |_| | | (_| | | | |___| |___ | |
/_/ \_\_| |_|\__, |\__,_|_|\__,_|_| \____|_____|___|
|___/

Angular CLI: 13.2.4
Node: 14.17.0
包管理器: npm 6.14.13
操作系统: darwin x64

Angular: 13.2.3
… animations, common, compiler, compiler-cli, core, forms
… platform-browser, platform-browser-dynamic, router

包版本
———————————————————
@angular-devkit/architect 0.1302.4
@angular-devkit/build-angular 13.2.4
@angular-devkit/core 13.2.4
@angular-devkit/schematics 13.2.4
@angular/cli 13.2.4
@schematics/angular 13.2.4
rxjs 6.6.0
typescript 4.5.5

先决条件

使用AWS Transcribe Streaming进行语音识别和文字转录时，我们将使用以下的SDK和库。

@aws-sdk/client-transcribe-streaming

npm はこちら

microphone-stream

npm はこちら

process

npm はこちら

buffer

npm はこちら

事前准备

为了后续组件的实现，我们将在此之前进行以下操作。

安装 @aws-sdk/client-transcribe-streaming
（从 npm 页面复制并粘贴）
$ npm i @aws-sdk/client-transcribe-streaming

安装 microphone-stream
（从 npm 页面复制并粘贴）
$ npm i microphone-stream

安装 process
（从 npm 页面复制并粘贴）
$ npm i process

安装 buffer
（从 npm 页面复制并粘贴）
$ npm i buffer

编辑 polyfills.ts
// 解决 “global is not defined” 的问题
(window as any).global = window;

// 参考：https://stackoverflow.com/questions/50313745/angular-6-process-is-not-defined-when-trying-to-serve-application
// 参考：https://www.npmjs.com/package/process
import * as process from ‘process’;
window.process = process;

// 参考：https://github.com/isaacs/core-util-is/issues/27
// 参考：https://www.npmjs.com/package/buffer
import * as buffer from ‘buffer’;
(window as any).Buffer = buffer.Buffer;

实施

模板

这只是一个简单的结构，可以控制音频识别的开始和结束，并输出转录的信息。

<div>
  <div class="title">
    <h2 class="h2-style">{{title}}</h2>
  </div>
  <div>
    <button type="button" class="event-button event-button-w-100" (click)="startVoiceRecognition($event)">音声認識-開始</button>
    <button type="button" class="event-button event-button-w-100" (click)="stopVoiceRecognition($event)">音声認識-終了</button>
  </div>
  <div class="output-area">
    <textarea class="output-text" readonly placeholder="AWS Transcribe Streaming で文字変換された情報が出力されます."></textarea>
  </div>
</div>

主要部分

使用AWS SDK进行语音识别和转录的示例代码。
请参考代码中的注释以获取有关每个操作的详细说明。

/* eslint-disable @typescript-eslint/no-non-null-assertion */
/* eslint-disable prefer-arrow/prefer-arrow-functions */
/* eslint-disable @typescript-eslint/naming-convention */
import { Component, ElementRef, OnInit } from '@angular/core';

// AWS Transcribe Streaming を使った文字起こしに必要なライブラリ群
// https://github.com/aws/aws-sdk-js-v3/tree/d8475f8d972d28fbc15cd7e23abfe18f9eab0644/clients/client-transcribe-streaming
import {
  TranscribeStreamingClient,
  StartStreamTranscriptionCommand,
  StartStreamTranscriptionCommandInput,
  LanguageCode,
  MediaEncoding,
  StartStreamTranscriptionCommandOutput,
} from '@aws-sdk/client-transcribe-streaming';

// AWS Transcribe Streaming に流す audio データを作るのに必要
// https://github.com/microphone-stream/microphone-stream#readme
const MicrophoneStream = require('microphone-stream').default;
let micStream: any = null;

@Component({
  selector: 'app-use-aws-transcribe-streaming',
  templateUrl: './use-aws-transcribe-streaming.component.html',
  styleUrls: ['../../../style/common.css', './use-aws-transcribe-streaming.component.css'],
})
export class UseAwsTranscribeStreamingComponent implements OnInit {
  title = 'AWS Transcribe Streaming を使ったサンプル';

  private outputArea: any = null;

  // AWS Transcribe Streaming を使うための準備
  // この処理でクライアントインスタンスが生成される
  // ここで生成したインスタンスは後述の処理で AWS Transcribe Streaming にコマンドを送る際に使用する
  //
  // サンプルコードなので credentials をハードコーディングしているがセキュリティ推奨されない
  // Cognito 認証と絡める等､別の手段で認証を通すことを検討するべき
  private client = new TranscribeStreamingClient({
    region: 'ap-northeast-1',
    credentials: {
      accessKeyId: 'hogehoge',
      secretAccessKey: 'hogehoge',
      // sessionToken: 'hogehoge', // ここは必要に応じて設定する
    },
  });

  constructor(private elementRef: ElementRef) {}

  ngOnInit() {
    this.outputArea = this.elementRef.nativeElement.querySelector('.output-text');
  }

  /**
   * html テンプレートの「音声認識-開始」がクリックされたら実行されるメソッド
   * 音声認識処理の起点
   */
  async startVoiceRecognition(event: any) {
    // micStream は後で出てくる `stopVoiceRecognition` で `stop()` を実行して音声認識を停止する
    // このとき実際には AudioContext.close() が実行されているのだが、これを実行すると micStream の再利用ができなくなる
    // なので、一度停止した場合は micStream のインスタンスを新規に生成してやる必要がある
    //
    // AudioContext.close() については下記を参照
    // https://developer.mozilla.org/ja/docs/Web/API/AudioContext/close
    if (!micStream) {
      micStream = new MicrophoneStream();
    }

    await this.setStream();

    // AWS Transcribe Streaming に流す音声データのパラメータ
    // 肝は `AudioStream: this.audioStream()` の部分。ここで音声データを作っている
    const params: StartStreamTranscriptionCommandInput = {
      // https://docs.aws.amazon.com/ja_jp/transcribe/latest/dg/API_streaming_StartStreamTranscription.html#API_streaming_StartStreamTranscription_RequestSyntax
      LanguageCode: LanguageCode.JA_JP,
      MediaSampleRateHertz: 44_100, // 有効範囲: 最小値は 8,000. 最大値は 48,000
      MediaEncoding: MediaEncoding.PCM,
      AudioStream: this.audioStream(),
      // VocabularyName: 'custom_vocabulary' // カスタム語彙を指定する場合はここを設定する
    };

    // 音声自動文字起こし機能
    const command = new StartStreamTranscriptionCommand(params);
    let response: StartStreamTranscriptionCommandOutput;
    try {
      // ここまでの処理で作成したクライアントインスタンスと音声データのパラメータから AWS Transcribe Streaming を実行する
      // `handleResponse()` はレスポンスから文字起こしをするための処理
      response = await this.client.send(command);
      await this.handleResponse(response);
    } catch (error: any) {
      console.dir(error);
    }
  }

  /**
   * AWS Transcribe Streaming からのレスポンスを解析し文字起こしを行う
   *
   * 実装の大まかな部分は AWS Transcribe Streaming SDK のサンプルコードをコピーしたもの
   * -> https://github.com/aws/aws-sdk-js-v3/tree/d8475f8d972d28fbc15cd7e23abfe18f9eab0644/clients/client-transcribe-streaming#handling-text-stream
   *
   * レスポンスの構成については下記を参照
   * -> https://docs.aws.amazon.com/ja_jp/transcribe/latest/dg/API_streaming_StartStreamTranscription.html#API_streaming_StartStreamTranscription_ResponseSyntax
   * */
  async handleResponse(response: StartStreamTranscriptionCommandOutput) {
    for await (const event of response.TranscriptResultStream!) {
      if (event.TranscriptEvent) {
        const results = event.TranscriptEvent.Transcript!.Results;

        let transcript = '';
        results!
          .filter((result) => !result.IsPartial) // 変換途中 は処理対象外とする
          .map((result) => {
            (result.Alternatives || []).map((alternative) => {
              transcript = alternative.Items!.map((item) => item.Content).join(' ');

              // 変換したデータは html のテキストエリアに出力する
              this.outputArea.innerHTML += transcript;
            });
          });
      }
    }
  }

  /**
   * html テンプレートの「音声認識-終了」がクリックされたら実行されるメソッド
   *
   * 前掲の startVoiceRecognition() の説明でも触れたとおり、
   * micStream.stop() を実行することで実際には AudioContext.close() が実行される
   * これによって、これまで使用していた micStream のインスタンスは役目を終える
   * 新たにストリームを流すためにはインスタンスを作り直す必要があるので、これを明示的に示すために
   * micStream = null; を行い、startVoiceRecognition() では null 判定を行った上でインスタンスを生成している
   */
  stopVoiceRecognition(event: any) {
    micStream.stop();
    micStream = null;
  }

  /**
   * 前掲の startVoiceRecognition() で使用されているメソッド
   * getUserMedia() では音声データのみを使用する設定でメディアの使用を要求し、
   * マイクに入力された音声をストリームにセットしている
   */
  private async setStream() {
    micStream.setStream(
      await window.navigator.mediaDevices.getUserMedia({
        video: false,
        audio: true,
      })
    );
  }

  /**
   * こちらも前掲の startVoiceRecognition() で使用されているメソッド
   *
   * ストリームに流れてくる音声データを PCM にエンコードする処理で、
   * 実装は AWS Transcribe Streaming SDK の サンプルコード をコピーした
   * -> https://github.com/aws/aws-sdk-js-v3/tree/d8475f8d972d28fbc15cd7e23abfe18f9eab0644/clients/client-transcribe-streaming#acquire-from-browsers
   */
  private audioStream = async function* () {
    for await (const chunk of micStream) {
      yield {
        AudioEvent: { AudioChunk: pcmEncodeChunk(chunk) /* pcm Encoding is optional depending on the source */ },
      };
    }
  };
}

/**
 * PCM エンコード処理の実態
 *
 * 実装は AWS Transcribe Streaming SDK の サンプルコード をコピーしたもの
 * -> https://github.com/aws/aws-sdk-js-v3/tree/d8475f8d972d28fbc15cd7e23abfe18f9eab0644/clients/client-transcribe-streaming#pcm-encoding
 */
function pcmEncodeChunk(chunk: Buffer) {
  const input = MicrophoneStream.toRaw(chunk);
  let offset = 0;
  const buffer = new ArrayBuffer(input.length * 2);
  const view = new DataView(buffer);
  for (let i = 0; i < input.length; i++, offset += 2) {
    const s = Math.max(-1, Math.min(1, input[i]));
    view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7fff, true);
  }
  return Buffer.from(buffer);
}

处理的流程

下面的顺序进行处理，进行语音识别和转录成文字。

1. setStream
2. pcmEncodeChunk
3. audioStream
4. handleResponse

每个处理的具体步骤如下所示。

1. setStream で audio から流れる音声を stream に流し続けるための設定を行う
2. 以降は延々と stream に音声が流れ続ける
3. 流れてくる stream は pcmEncodeChunk で解析して
4. auditoStream で AWS Transcribe Streaming にわたす
5. 返ってきた情報を handleResponse で html に出力する

关于代码的补充说明

本文旨在以示例代码的形式，通过组件来使处理流程更易于理解。

｢HTMLへの出力｣

我们正在进行一次性处理，但对于每个处理，有些最好委托给服务进行。如果您参考本代码，请根据需要进行评估。

在实际应用中，我们应该考虑使用另一种认证机制。

在上述代码示例中，硬编码传递给 SDK 的凭据是不安全的。考虑使用其他方法，例如与 Cognito 认证相关联来通过 SDK 进行身份验证，会更好。

以下是与Cognito身份验证相关的文档和教程。

ブラウザスクリプトを準備します。

IAM角色的权限

如果通过SDK进行身份验证但权限未设置，将出现错误（AccessDeniedException）。

AccessDeniedException: {
  "Message":"User: ${user情報} is not authorized to perform: transcribe:StartStreamTranscriptionWebSocket because no identity-based policy allows the transcribe:StartStreamTranscriptionWebSocket action"
}

因此，需要授予访问权限，但 AWS 提供的 Transcribe 策略只有以下两个选项。（截至2022年06月08日确认）

ポリシー名説明AmazonTranscribeFullAccessフルアクセスAmazonTranscribeReadOnlyAccess読み取りのみ

AmazonTranscribeFullAccess(フルアクセス)是权限过多的，并且AmazonTranscribeReadOnlyAccess(只读权限)没有用于转录流的权限。因此，在这里，我们创建了一个专用策略。(省略了策略创建的步骤)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "transcribe:StartStreamTranscriptionWebSocket",
            "Resource": "*"
        }
    ]
}

通过将此策略附加到使用 SDK 的 IAM 角色上，将授予执行 Transcribe Streaming 的权限。

自定义词汇

AWS Transcribe Streaming 是一个非常方便的实时转录服务，但是并不总是能够准确识别全部内容。
如果希望提高语音识别的准确度，可以事先在自定义词汇表中注册词汇进行处理。

我将从前述的代码中提取出指定自定义词汇的部分。

    // AWS Transcribe Streaming に流す音声データのパラメータ
    // 肝は `AudioStream: this.audioStream()` の部分。ここで音声データを作っている
    const params: StartStreamTranscriptionCommandInput = {
      // https://docs.aws.amazon.com/ja_jp/transcribe/latest/dg/API_streaming_StartStreamTranscription.html#API_streaming_StartStreamTranscription_RequestSyntax
      LanguageCode: LanguageCode.JA_JP,
      MediaSampleRateHertz: 44_100, // 有効範囲: 最小値は 8,000. 最大値は 48,000
      MediaEncoding: MediaEncoding.PCM,
      AudioStream: this.audioStream(),

      // ★★★↓このパラメータを指定します★★★
      VocabularyName: 'custom_vocabulary' // カスタム語彙を指定する場合はここを設定する
    };

根据阅读，AWS似乎推荐使用表格格式而不是列表格式，尽管在自定义词汇中有表格和列表两种格式可供选择。

（节选）
词汇表与列表相比
强烈推荐使用词汇表。

关于自定义词汇的注册流程，我们将不再赘述，请参考以下内容。

カスタム語彙の作成

请注意，正如上述文章中所提到的，如果您从控制台注册Table格式的文件，会导致错误。所以请确保通过S3进行注册Table格式的文件。

源代码

本文中所使用的代码是用于本次文章的操作确认。

请提供更多的上下文和要求，以便为您提供准确的中文表达。

StartStreamTranscription

リクエストの構文
レスポンスの構文

Custom vocabularies
Creating a custom vocabulary using a table
Creating a custom vocabulary using a list