Automating Speech-to-Text: How to Transcribe Audio & Video with Azure Speech Services

Research Blog

Automating Speech-to-Text: How to Transcribe Audio & Video with Azure Speech Services

04 Jan 2025

•

5 min read

Introduction

In today's digital landscape, businesses and content creators rely on speech-to-text technology for efficient transcription of audio and video files. Whether you're a developer, researcher, or media professional, automating speech transcription saves time and enhances productivity.

This guide will walk you through building an automated transcription tool using Azure Cognitive Services’ Speech-to-Text API on Linux Ubuntu. By the end of this article, you’ll be able to:

Convert video files to audio for transcription
Normalize audio formats for better accuracy
Leverage Azure Speech-to-Text API for precise transcriptions
Automate the transcription process using Python on Ubuntu
Optionally, run this workflow on an Azure Virtual Machine (VM)

Why Automate Speech-to-Text Transcription?

Manual transcription is time-consuming and prone to errors. Automating this process enhances efficiency, ensuring accurate and swift text conversion from multimedia content. Azure Speech Services provides robust AI-powered transcription capabilities, making it a preferred choice for businesses, podcasters, and professionals.

To learn more about AI-powered development, check out our Custom Software Development Services.

Prerequisites

Before setting up the transcription tool, ensure you have:

A Microsoft Azure account with Speech Services enabled
Python 3 installed on Ubuntu
FFmpeg for media file conversion
Required Python libraries: azure-cognitiveservices-speech, moviepy, argparse

Run the following commands to install dependencies:

sudo apt update && sudo apt install ffmpeg -ypip install azure-cognitiveservices-speech moviepy argparse

Step 1: Setting Up Azure Speech Services

Create an Azure Account: Sign up at Azure Portal if you don’t have an account.
Set Up Speech Services: Navigate to Azure Speech Services, create a resource, select a pricing tier, and copy the API Key and Region from the Keys and Endpoint tab.
Configure the Speech SDK in Python:

import azure.cognitiveservices.speech as speechsdkspeech_config = speechsdk.SpeechConfig(    subscription="YOUR_AZURE_SPEECH_KEY",    region="YOUR_AZURE_REGION")

Step 2: Writing the Python Script

Handling Command-Line Arguments

import argparseparser = argparse.ArgumentParser(description="Transcribe speech from video and audio files.")parser.add_argument("media_files", nargs="+", help="Paths to video/audio files")args = parser.parse_args()

Extract Audio from Video Files

import subprocessdef extract_audio(video_file):    audio_file = f"{video_file.rsplit('.', 1)[0]}_audio.wav"    subprocess.run([        "ffmpeg", "-i", video_file, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", audio_file, "-y"    ], check=True)    return audio_file

Convert Audio to the Required Format

def convert_audio_to_wav(input_audio):    output_wav = input_audio.rsplit('.', 1)[0] + "_fixed.wav"    subprocess.run([        "ffmpeg", "-i", input_audio, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", output_wav, "-y"    ], check=True)    return output_wav

Transcribe Audio Using Azure Speech-to-Text

def transcribe_audio(audio_file, speech_config):    audio_config = speechsdk.audio.AudioConfig(filename=audio_file)    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)    result = speech_recognizer.recognize_once()    return result.text if result.reason == speechsdk.ResultReason.RecognizedSpeech else None

Save the Transcription

import osdef save_transcription(text, filename):    os.makedirs("transcriptions", exist_ok=True)    with open(f"transcriptions/{filename}_transcription.txt", "w") as file:        file.write(text)

Step 3: Running the Script

To transcribe an audio or video file, run:

python transcribe.py video1.mp4 audio1.wav

This script will:

Extract audio from video (if applicable)
Convert the audio to the required format
Send it to Azure’s Speech-to-Text API
Save the transcribed text in the transcriptions/ folder

Advanced Features & Future Enhancements

This workflow can be expanded to support:

Live speech transcription for real-time applications
Multi-speaker recognition for differentiating voices
Automatic translation for multilingual content

Looking for expert mobile and web solutions? Explore our Mobile App Development Services.

Conclusion

By leveraging Azure Cognitive Services, this automated speech-to-text transcription tool provides accurate, efficient, and scalable solutions for processing audio and video files. Whether you're handling podcasts, interviews, or business meetings, this approach saves time and ensures high-quality transcriptions.

For complete source code, visit: GitHub Repository

‍

Dheeraj Kumar

Technical Project Manager

Tech Lead with 8+ years of experience in Software development, project management, and UI/UX design, specialising in building scalable mobile applications, leading cross-functional teams, and delivering user-centric solutions with a strong focus on performance, quality, and innovation.