Post

Audio Transcription using Whisper from OpenAI

Recently I found myself with the need to transcribe an entire YouTube interview. The prupose of this post is to use AI to transcribe the audio to text and then translate from Spanish to English.

The interview was hosted by people from Opground.com. In this post I should acknoledge Eduard Teixidó and Marcel Gonzalbo from Opground for the interview. Also I thank Lambda AI for providing free credit to run the inference in the AI models described in the post.

Introduction

Transcription is the process of converting speech or audio into written text. As an example, in the spanish congress of deputies, there exist the job of stenographer: A person that writes in paper everything that is said in the chamber to later be saved and published officially. These stenographers perform perfectly their job, they transcribe exactly what is said. Technology however, can help us accelreate the transcription of audio that is already recorded so that we can work with the text.

One of the first audio-to-text systems was Audrey by Bell Labs developed in the 50’s of last century. The system was able to recognize phonemes, not words or sentences and “the huge machine occupied a six-foot-high relay rack, consumed substantial power and had streams of cables”.

Audrey was a great breakthrough but clearly not feaseable for practical implementations. Luckyly the field of AI has made a huge progress in the last two decades and one of the models that has great perofrmance is Whisper from OpenAI. Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The large model has around 1.55B parameters or 6.2 GB in floating points of 32 bits. Find more details of the model in the paper Robust Speech Recognition via Large-Scale Weak Supervision and the github repository openai/whisper. This model is complex… so I defer to the reader to use the references provided to understand the architecture and the backgorund. We will use inference in the model in Spanish and according to the Readme of the repository, the large-v3 model has around 4.7 Word Error Rate (WER), which is an impressive metric. In this post we will use Whisper large-v3 model to transcribe the interview.

Instance in Lambda AI

As mentioned I’m using Lambda AI as cloud service to use a GPU. Yes, tried running the AI model in my Mac and… surprise, I had to cancel it, was taking too long. I’m using a gpu_1x_a100_sxm4 machine. Once I’m in the machine I run gpu_info (a CLI tool I build on another post) to get the characteristigs of the GPU.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Detected 1 CUDA Capable Device(s)

Device 0: NVIDIA A100-SXM4-40GB
  PCI Domain/Bus/Device ID: 0/7/0
  Compute capability: 8.0
  Total global memory: 40442.4 MB
  Free memory (current): 40019.6 MB
  Total allocatable memory (current): 40442.4 MB
  Memory clock rate: 1215 MHz
  Memory bus width: 5120 bits
  L2 cache size: 40960 KB
  Max shared memory per block: 48 KB
  Total constant memory: 64 KB
  Warp size: 32
  Max threads per block: 1024
  Max threads per multiprocessor: 2048
  Multiprocessor count: 108
  Max grid dimensions: [2147483647, 65535, 65535]
  Max block dimensions: [1024, 1024, 64]
  Clock rate: 1410 MHz
  Concurrent kernels: Yes
  ECC enabled: Yes
  Integrated device: No
  Can map host memory: Yes
  Compute mode: Default
  Unified addressing: Yes
  Async engines: 3
  Device overlap: Yes
  PCI bus ID: 7
  PCI device ID: 0

This is an Ampere 100 GPU with 40GB of memory, a great GPU for our purposes, inference. Now running lscpu you can get the information of the CPU of the machine, won’t extend here but just mention that it is an x86_64 architecture model AMD EPYC 7J13 64-Core Processor (check specs here). Pretty nice machine inedeed!.

Downloading audio from YouTube

First we need to download the audio, you can do that directly from youtube. Normally the app we will be using to download uses the cookies from your browswer. That makes things hard as remote machines are normally pure command line and don’t have a browser. It is more convenient to download the audio in your local machine and then copy it to your remote machine.

Use the following script, and name it download_audio.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import argparse
from pathlib import Path
from yt_dlp import YoutubeDL

def download_audio(url: str, out_dir: Path) -> Path:
    out_dir.mkdir(parents=True, exist_ok=True)
    ydl_opts = {
        "outtmpl": str(out_dir / "%(title)s.%(ext)s"),
        "format": "bestaudio/best",
        "noplaylist": True,
        "quiet": True,
        "no_warnings": True,
        "postprocessors": [
            {"key": "FFmpegExtractAudio", "preferredcodec": "wav", "preferredquality": "192"}
        ],
    }
    with YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
    # Try to resolve final .wav
    expected = out_dir / f"{info.get('title','audio')}.wav"
    if expected.exists():
        return expected
    # Fallback: newest wav in folder
    wavs = list(out_dir.glob("*.wav"))
    if not wavs:
        raise RuntimeError("No WAV file produced. Check ffmpeg/yt-dlp output.")
    return max(wavs, key=lambda p: p.stat().st_mtime)

def main():
    ap = argparse.ArgumentParser(description="Download YouTube audio as WAV.")
    ap.add_argument("url", help="YouTube URL")
    ap.add_argument("-o", "--outdir", default="outputs/_tmp", help="Output dir (default: outputs/_tmp)")
    args = ap.parse_args()

    out_dir = Path(args.outdir).resolve()
    wav = download_audio(args.url, out_dir)

if __name__ == "__main__":
    main()

Now create a virtual environment and install yt-dlp, I’m using python version 3.12.

1
2
3
rm -rf .venv
python -m venv .venv
.venv/bin/python -m pip install -U yt-dlp

Run the command with your video URL:

1
PYTHONWARNINGS=ignore .venv/bin/python download_audio.py "https://www.youtube.com/watch?v=VIDEO_ID"

Now find your *.wav file in outputs/_tmp/ from where you ran the script. Will have the same name as the original video, you can change it to audio.wav to make it more simple.

Copy audio to remote machine

Now with the audio we downloaded we need to copy the file to the remote machine with something like:

1
scp -i $HOME/.ssh/id_lambda ~/transcription/outputs/_tmp/audio.wav ubuntu@PUBLIC_IP:/home/ubuntu/audio.wav

Changing the PUBLIC_IP by your public IP provided by the cloud service. It’s pretty straightforward to get it from the Lambda instances webpage. Then the -i argument is followed by the private key generated to SSH to the remote machine.

Run Inference on a machine with GPU

Finally the hard part, use Whisper model to run inference. For that, in the remote machine we will create a new python environment with:

1
2
3
rm -rf .venv
python -m venv .venv
.venv/bin/pip install -U openai-whisper

I got the default system python version as 3.10.12 which is a relatively recent version. Then activate the environment and check that the executable whisper is installed:

1
2
source .venv/bin/activate
which whisper

Finally run inference using the Ampere 100 GPU with the command:

1
2
3
4
5
6
7
8
9
10
whisper audio.wav \
  --model large-v3 \
  --language es \
  --task transcribe \
  --device cuda \
  --fp16 True \
  --temperature 0 \
  --beam_size 1 \
  --output_format txt \
  --output_dir large-v3

which will create a directory large-v3 with the contents audio.txt. In the interview I get the first 10 lines with

1
cat large-v3/audio.txt | head -10

as

1
2
3
4
5
6
7
8
9
10
a las historias de las personas que hacen realidad esta evolución tecnológica, los techies.
Y nada de esto sería posible sin el soporte de Upground, el primer reclutador virtual.
Un sistema basado en inteligencia artificial que replica entrevistas virtuales
y con solo una única entrevista con su chatbot, busca, aplica y gestiona
todas las oportunidades del sector tech por ti.
¿Hay algo por lo que aceptarías un nuevo reto profesional?
No sacrifiques tu tiempo libre, que Upground es tu aliado.
Y con esto empezamos el día de hoy. Hola Marcel.
Hola, ¿qué tal Eduard? Buenos días, buen día. ¿Cómo estamos?
Muy bien, aquí estamos. Hoy por la mañana que tenemos un invitado muy interesante

Translate to english

Use Whisper to translate to english, all parameters are the same but the task, which his translate this time.

1
2
3
4
5
6
7
8
9
10
whisper audio.wav \
  --model large-v3 \
  --language es \
  --task translate \
  --device cuda \
  --fp16 True \
  --temperature 0 \
  --beam_size 1 \
  --output_format txt \
  --output_dir large-v3_en

with the first 10 lines

1
2
3
4
5
6
7
8
9
10
to the stories of the people who make this technological evolution a reality, the techies.
And none of this would be possible without the support of UpGround, the first virtual recruiter.
A system based on artificial intelligence that replicates virtual interviews
and with only one interview with its chatbot,
searches, applies and manages all opportunities in the tech sector for you.
Is there something you would accept as a new professional challenge?
Don't waste your free time, because UpGround is your ally.
And with this we begin today. Hello Marcel.
Hello, how are you Eduard? Good morning, how are you?
Very well, here we are. Today in the morning we have a very interesting guest

that seems pretty close to the Spanish version. We made it!.

Conclusions and future analysis

This has been a quick job, a quick translation. It ran just fine, as a matter of fact, I had to go to the english version and modify parts of the text. It was predicting most words correctly but the context was not understandable sometimes. I don’t think this model is wrong, obviously, I just didn’t have the time to investigate further. Perhaps my audio quality wasn’t good?. Maybe I needed to adjust other parameters like temperature when running the inference?. Anyways, it was a fun exercise that has some practicality for me too. If you reached this part, thank you for reading the post!.

This post is licensed under CC BY 4.0 by the author.