Automating Speech-to-Text: How to Transcribe Audio & Video with Azure Speech Services

Introduction
In today's digital landscape, businesses and content creators rely on speech-to-text technology for efficient transcription of audio and video files. Whether you're a developer, researcher, or media professional, automating speech transcription saves time and enhances productivity.
This guide will walk you through building an automated transcription tool using Azure Cognitive Services’ Speech-to-Text API on Linux Ubuntu. By the end of this article, you’ll be able to:
- Convert video files to audio for transcription
- Normalize audio formats for better accuracy
- Leverage Azure Speech-to-Text API for precise transcriptions
- Automate the transcription process using Python on Ubuntu
- Optionally, run this workflow on an Azure Virtual Machine (VM)
Why Automate Speech-to-Text Transcription?
Manual transcription is time-consuming and prone to errors. Automating this process enhances efficiency, ensuring accurate and swift text conversion from multimedia content. Azure Speech Services provides robust AI-powered transcription capabilities, making it a preferred choice for businesses, podcasters, and professionals.
To learn more about AI-powered development, check out our Custom Software Development Services.
Prerequisites
Before setting up the transcription tool, ensure you have:
- A Microsoft Azure account with Speech Services enabled
- Python 3 installed on Ubuntu
- FFmpeg for media file conversion
- Required Python libraries: azure-cognitiveservices-speech, moviepy, argparse
Run the following commands to install dependencies:
sudo apt update && sudo apt install ffmpeg -ypip install azure-cognitiveservices-speech moviepy argparseStep 1: Setting Up Azure Speech Services
- Create an Azure Account: Sign up at Azure Portal if you don’t have an account.
- Set Up Speech Services: Navigate to Azure Speech Services, create a resource, select a pricing tier, and copy the API Key and Region from the Keys and Endpoint tab.
- Configure the Speech SDK in Python:
import azure.cognitiveservices.speech as speechsdkspeech_config = speechsdk.SpeechConfig( subscription="YOUR_AZURE_SPEECH_KEY", region="YOUR_AZURE_REGION")Step 2: Writing the Python Script
Handling Command-Line Arguments
import argparseparser = argparse.ArgumentParser(description="Transcribe speech from video and audio files.")parser.add_argument("media_files", nargs="+", help="Paths to video/audio files")args = parser.parse_args()Extract Audio from Video Files
import subprocessdef extract_audio(video_file): audio_file = f"{video_file.rsplit('.', 1)[0]}_audio.wav" subprocess.run([ "ffmpeg", "-i", video_file, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", audio_file, "-y" ], check=True) return audio_fileConvert Audio to the Required Format
def convert_audio_to_wav(input_audio): output_wav = input_audio.rsplit('.', 1)[0] + "_fixed.wav" subprocess.run([ "ffmpeg", "-i", input_audio, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", output_wav, "-y" ], check=True) return output_wavTranscribe Audio Using Azure Speech-to-Text
def transcribe_audio(audio_file, speech_config): audio_config = speechsdk.audio.AudioConfig(filename=audio_file) speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config) result = speech_recognizer.recognize_once() return result.text if result.reason == speechsdk.ResultReason.RecognizedSpeech else NoneSave the Transcription
import osdef save_transcription(text, filename): os.makedirs("transcriptions", exist_ok=True) with open(f"transcriptions/{filename}_transcription.txt", "w") as file: file.write(text)Step 3: Running the Script
To transcribe an audio or video file, run:
python transcribe.py video1.mp4 audio1.wav
This script will:
- Extract audio from video (if applicable)
- Convert the audio to the required format
- Send it to Azure’s Speech-to-Text API
- Save the transcribed text in the transcriptions/ folder
Advanced Features & Future Enhancements
This workflow can be expanded to support:
- Live speech transcription for real-time applications
- Multi-speaker recognition for differentiating voices
- Automatic translation for multilingual content
Looking for expert mobile and web solutions? Explore our Mobile App Development Services.
Conclusion
By leveraging Azure Cognitive Services, this automated speech-to-text transcription tool provides accurate, efficient, and scalable solutions for processing audio and video files. Whether you're handling podcasts, interviews, or business meetings, this approach saves time and ensures high-quality transcriptions.
For complete source code, visit: GitHub Repository
.png)
.png)
.png)
.png)
.png)


.png)
.png)
.png)
.png)
.png)
.png)


.png)
.png)
.png)
.png)
.png)
.png)

.png)
.png)


.png)
.png)
.png)


.png)




.png)







.png)


.png)


.png)


.png)





.png)
.png)

.png)
.png)




.png)
.png)


.png)



.png)
.png)














.png)
.png)
.png)

.png)

