Whisper: Audio To Text

我們的 前一堂共學 中,最後一堂也是 Whisper API。

可能是和其他 OpenAI API 的關聯不大,也可能是因為上線的時間比較短(今年3月1日)。

比較一下這兩堂課,內容大同小異,最主要的差別:

  • 前一堂共學:自動切分 mp3(因有 25MB 上限問題)。

  • 前一堂共學:使用較底層的 whisper.detect_language()whisper.decode()

  • 本堂共學:錄音程式(OpenAI API 無關)。

  • 本堂共學:將 github Whisper API 模型,整個下載到 local 端執行。


Whisper API 介紹

本段由 前一堂共學 複製過來。

有誠意的 open source:

methods

Whisper API 有兩個 methods: transcriptions 以及 translations

  • transcriptions: 辨識音檔後輸出文字。 支援語言

  • translations: 辨識音檔後輸出並翻譯成英文。

支援語言

The figure shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model.

來源:GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

價格

Model Usage
Whisper $0.006 / minute (rounded to the nearest second)

資料來源:Pricing

其他參考資料


使用 Whisper API

範例一

老師不曉得哪裡複製來的錄音程式,與 OpenAI 無關。

# Record Some audio

import wave
import sys
import pyaudio

CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1 if sys.platform == "darwin" else 2
RATE = 44100


def record_audio(seconds: int):
    output_path = "output.wav"
    with wave.open(output_path, "wb") as wf:
        p = pyaudio.PyAudio()
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(p.get_sample_size(FORMAT))
        wf.setframerate(RATE)

        stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True)

        print("Recording...")
        for index in range(0, RATE // CHUNK * seconds):
            if index % (RATE // CHUNK) == 0:
                print(f"{index // (RATE // CHUNK)} / {seconds}s")
            wf.writeframes(stream.read(CHUNK))
        print("Done")

        stream.close()
        p.terminate()
    print(f"File saved at {output_path}")
    return output_path

安裝:使用 –upgrade 或 -U 來更新套件至最新版本。

pip install -U openai-whisper

簡單示範整個程式流程。以及利用 prompt 說明,來校正 speech to text 的拼字錯誤。

record_audio(10)

audio_file = open("output.wav", "rb")

response = openai.Audio.transcribe(
    model="whisper-1",
    file=audio_file
)

response["text"]

# fixing the typo
response_with_prompt = openai.Audio.transcribe(
    model="whisper-1",
    file=audio_file,
    prompt="man talking about OpenAI and DALL-E"
)

response_with_prompt["text"]

範例二 openai.Audio.transcribe

錄音程式同前,不贅述。

record_audio(5, "french.wav")

french_file = open("./audio/french.wav", "rb")

french_response = openai.Audio.transcribe(
    model="whisper-1",
    file=french_file
)

french_response

範例三 openai.Audio.translate

錄音程式同前,不贅述。

italian_news = open("./audio/italian_news.wav", "rb")

italian_response = openai.Audio.translate(
    model="whisper-1",
    file=italian_news
)

italian_response

範例四 複製模型到 local 端

import whisper

model = whisper.load_model("base")

res = model.transcribe("./audio/italian_news.wav")

res

Available models and languages

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

官方示範程式

import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)