Whisper: Audio To Text

sky · 2023年07月12日09:58

我們的 前一堂共學 中，最後一堂也是 Whisper API。

可能是和其他 OpenAI API 的關聯不大，也可能是因為上線的時間比較短（今年3月1日）。

比較一下這兩堂課，內容大同小異，最主要的差別：

前一堂共學：自動切分 mp3（因有 25MB 上限問題）。
前一堂共學：使用較底層的 whisper.detect_language() 和 whisper.decode()。
本堂共學：錄音程式（OpenAI API 無關）。
本堂共學：將 github Whisper API 模型，整個下載到 local 端執行。

Whisper API 介紹

本段由 前一堂共學 複製過來。

有誠意的 open source:

methods

Whisper API 有兩個 methods: transcriptions 以及 translations：

transcriptions: 辨識音檔後輸出文字。 支援語言
translations: 辨識音檔後輸出並翻譯成英文。

支援語言

The figure shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model.

來源：GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

價格

Model	Usage
Whisper	$0.006 / minute (rounded to the nearest second)

資料來源：https://openai.com/pricing

其他參考資料

https://openai.com/blog/introducing-chatgpt-and-whisper-apis

使用 Whisper API

範例一

老師不曉得哪裡複製來的錄音程式，與 OpenAI 無關。

# Record Some audio

import wave
import sys
import pyaudio

CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1 if sys.platform == "darwin" else 2
RATE = 44100


def record_audio(seconds: int):
    output_path = "output.wav"
    with wave.open(output_path, "wb") as wf:
        p = pyaudio.PyAudio()
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(p.get_sample_size(FORMAT))
        wf.setframerate(RATE)

        stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True)

        print("Recording...")
        for index in range(0, RATE // CHUNK * seconds):
            if index % (RATE // CHUNK) == 0:
                print(f"{index // (RATE // CHUNK)} / {seconds}s")
            wf.writeframes(stream.read(CHUNK))
        print("Done")

        stream.close()
        p.terminate()
    print(f"File saved at {output_path}")
    return output_path

安裝：使用 –upgrade 或 -U 來更新套件至最新版本。

pip install -U openai-whisper

簡單示範整個程式流程。以及利用 prompt 說明，來校正 speech to text 的拼字錯誤。

record_audio(10)

audio_file = open("output.wav", "rb")

response = openai.Audio.transcribe(
    model="whisper-1",
    file=audio_file
)

response["text"]

# fixing the typo
response_with_prompt = openai.Audio.transcribe(
    model="whisper-1",
    file=audio_file,
    prompt="man talking about OpenAI and DALL-E"
)

response_with_prompt["text"]

範例二 openai.Audio.transcribe

錄音程式同前，不贅述。

record_audio(5, "french.wav")

french_file = open("./audio/french.wav", "rb")

french_response = openai.Audio.transcribe(
    model="whisper-1",
    file=french_file
)

french_response

範例三 openai.Audio.translate