How to Use Whisper to Extract Video Text for Free

  • 466Words
  • 2Minutes
  • 14 Aug, 2024

When working with video files, there are times when you need to transcribe the audio portion into text. If the video does not have embedded subtitles, you can use OpenAI’s Whisper model to achieve this. This article will detail how to use Python and the Whisper model to extract audio from a video and transcribe it into text. We will first cover how to transcribe using a CPU, followed by how to install GPU dependencies, detect a GPU, and use a GPU for acceleration.

1. Speech Recognition Using CPU

First, ensure that Python and ffmpeg are installed. Then, install Whisper and ffmpeg-python:

Terminal window
1
pip install whisper-openai
2
pip install ffmpeg-python

1.2 Extracting Audio from Video

Use ffmpeg to extract the audio and save it in WAV format:

1
import ffmpeg
2
3
def extract_audio(video_path, output_audio_path):
4
ffmpeg.input(video_path).output(output_audio_path).run()
5
6
video_path = 'path/to/your/video.mp4'
7
audio_path = 'output.wav'
8
extract_audio(video_path, audio_path)

1.3 Transcribing Using CPU

Without a GPU, the Whisper model will process the transcription using the CPU. Here’s an example of how to use Whisper for speech recognition:

1
import whisper
2
3
def transcribe_audio(audio_path):
4
model = whisper.load_model("base")
5
result = model.transcribe(audio_path)
6
return result["text"]
7
8
transcription = transcribe_audio(audio_path)
9
print(transcription)

2. Accelerating with GPU

2.1 Installing GPU Dependencies

If you want to use a GPU for acceleration, you need to install the GPU version of PyTorch and its dependencies:

Terminal window
1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2.2 Detecting Available GPU

Before using a GPU, you need to check if there is an available GPU in your system. The following code can be used to detect a GPU:

1
import torch
2
3
print("CUDA Available: ", torch.cuda.is_available())
4
print("Number of GPUs: ", torch.cuda.device_count())
5
print("Current GPU: ", torch.cuda.current_device())
6
print("GPU Name: ", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU available")

2.3 Transcribing Using GPU

If your system has an available GPU, you can load the Whisper model onto the GPU for acceleration. Ensure that you have installed the GPU version of PyTorch as per the earlier steps. Here is an example of how to use a GPU for speech recognition:

1
import whisper
2
import torch
3
4
# Check if a GPU is available
5
device = "cuda" if torch.cuda.is_available() else "cpu"
6
model = whisper.load_model("base").to(device)
7
8
def transcribe_audio(audio_path):
9
result = model.transcribe(audio_path)
10
return result["text"]
11
12
transcription = transcribe_audio(audio_path)
13
print(transcription)

3. Complete Code Example

Combining all the steps, here is a complete code example that includes audio extraction and transcription using both CPU and GPU:

1
import ffmpeg
2
import whisper
3
import torch
4
5
def extract_audio(video_path, output_audio_path):
6
ffmpeg.input(video_path).output(output_audio_path).run()
7
8
def transcribe_audio(audio_path, device):
9
model = whisper.load_model("base").to(device)
10
result = model.transcribe(audio_path)
11
return result["text"]
12
13
# File path configuration
14
video_path = 'path/to/your/video.mp4'
15
audio_path = 'output.wav'
16
17
# Extract audio
18
extract_audio(video_path, audio_path)
19
20
# Check if GPU is available
21
device = "cuda" if torch.cuda.is_available() else "cpu"
22
print(f"Using device: {device}")
23
24
# Perform speech recognition
25
transcription = transcribe_audio(audio_path, device)
26
print(transcription)

4. Conclusion

With the above steps, you can use the Whisper model to extract audio from a video and generate a text file. If a GPU is available on your system, loading the model onto the GPU can significantly improve processing performance.