Home
/
Blog
/
/
Research Blog

Automating Speech-to-Text: How to Transcribe Audio & Video with Azure Speech Services 

04 Jan 2025
5 min read
Automating Speech-to-Text

Introduction 

In today's digital landscape, businesses and content creators rely on speech-to-text technology for efficient transcription of audio and video files. Whether you're a developer, researcher, or media professional, automating speech transcription saves time and enhances productivity. 

This guide will walk you through building an automated transcription tool using Azure Cognitive Services’ Speech-to-Text API on Linux Ubuntu. By the end of this article, you’ll be able to: 

  • Convert video files to audio for transcription 
  • Normalize audio formats for better accuracy 
  • Leverage Azure Speech-to-Text API for precise transcriptions 
  • Automate the transcription process using Python on Ubuntu 
  • Optionally, run this workflow on an Azure Virtual Machine (VM) 

Why Automate Speech-to-Text Transcription? 

Manual transcription is time-consuming and prone to errors. Automating this process enhances efficiency, ensuring accurate and swift text conversion from multimedia content. Azure Speech Services provides robust AI-powered transcription capabilities, making it a preferred choice for businesses, podcasters, and professionals. 

To learn more about AI-powered development, check out our Custom Software Development Services. 

Prerequisites 

Before setting up the transcription tool, ensure you have: 

  • A Microsoft Azure account with Speech Services enabled 
  • Python 3 installed on Ubuntu 
  • FFmpeg for media file conversion 
  • Required Python libraries: azure-cognitiveservices-speech, moviepy, argparse 

Run the following commands to install dependencies: 

sudo apt update && sudo apt install ffmpeg -ypip install azure-cognitiveservices-speech moviepy argparse

Step 1: Setting Up Azure Speech Services 

  1. Create an Azure Account: Sign up at Azure Portal if you don’t have an account. 
  2. Set Up Speech Services: Navigate to Azure Speech Services, create a resource, select a pricing tier, and copy the API Key and Region from the Keys and Endpoint tab. 
  3. Configure the Speech SDK in Python: 
import azure.cognitiveservices.speech as speechsdkspeech_config = speechsdk.SpeechConfig(    subscription="YOUR_AZURE_SPEECH_KEY",    region="YOUR_AZURE_REGION")

Step 2: Writing the Python Script 

Handling Command-Line Arguments 

import argparseparser = argparse.ArgumentParser(description="Transcribe speech from video and audio files.")parser.add_argument("media_files", nargs="+", help="Paths to video/audio files")args = parser.parse_args()

Extract Audio from Video Files 

import subprocessdef extract_audio(video_file):    audio_file = f"{video_file.rsplit('.', 1)[0]}_audio.wav"    subprocess.run([        "ffmpeg", "-i", video_file, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", audio_file, "-y"    ], check=True)    return audio_file

Convert Audio to the Required Format 

def convert_audio_to_wav(input_audio):    output_wav = input_audio.rsplit('.', 1)[0] + "_fixed.wav"    subprocess.run([        "ffmpeg", "-i", input_audio, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", output_wav, "-y"    ], check=True)    return output_wav

Transcribe Audio Using Azure Speech-to-Text 

def transcribe_audio(audio_file, speech_config):    audio_config = speechsdk.audio.AudioConfig(filename=audio_file)    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)    result = speech_recognizer.recognize_once()    return result.text if result.reason == speechsdk.ResultReason.RecognizedSpeech else None

Save the Transcription 

import osdef save_transcription(text, filename):    os.makedirs("transcriptions", exist_ok=True)    with open(f"transcriptions/{filename}_transcription.txt", "w") as file:        file.write(text)

Step 3: Running the Script 

To transcribe an audio or video file, run: 

python transcribe.py video1.mp4 audio1.wav

This script will: 

  1. Extract audio from video (if applicable) 
  2. Convert the audio to the required format 
  3. Send it to Azure’s Speech-to-Text API 
  4. Save the transcribed text in the transcriptions/ folder 

Advanced Features & Future Enhancements 

This workflow can be expanded to support: 

  • Live speech transcription for real-time applications 
  • Multi-speaker recognition for differentiating voices 
  • Automatic translation for multilingual content 

Looking for expert mobile and web solutions? Explore our Mobile App Development Services. 

Conclusion

By leveraging Azure Cognitive Services, this automated speech-to-text transcription tool provides accurate, efficient, and scalable solutions for processing audio and video files. Whether you're handling podcasts, interviews, or business meetings, this approach saves time and ensures high-quality transcriptions. 

For complete source code, visit: GitHub Repository 

Dheeraj profile image
Dheeraj Kumar
Technical Project Manager

Tech Lead with 8+ years of experience in Software development, project management, and UI/UX design, specialising in building scalable mobile applications, leading cross-functional teams, and delivering user-centric solutions with a strong focus on performance, quality, and innovation.

Dheeraj profile image
Dheeraj Kumar
Technical Project Manager

Tech Lead with 8+ years of experience in Software development, project management, and UI/UX design, specialising in building scalable mobile applications, leading cross-functional teams, and delivering user-centric solutions with a strong focus on performance, quality, and innovation.

10 Common MVP mistakes startups make
Mobile App Development
10 Common MVP Mistakes That Burn Startup Budgets
12 Jun 2026
Flutter vs React Native comparison
Mobile App Development
Flutter vs React Native: Which Is Better in 2026?
24 Apr 2026
Mobile App Development
How to Build an MVP in 30 Days (Step-by-Step Guide)
10 Apr 2026
Mobile App Development
App Development Cost Breakdown: MVP vs Full Product
01 Apr 2026
Human reviewing AI-generated code on screen
Artificial intelligence
Why Founders Over-Trust AI in Software Development
20 Mar 2026
AI brain and human intelligence
Artificial intelligence
AI Wrote the Code. Humans Own the Consequences.
04 Mar 2026
AI Meets Human Creativity and Design Taste
Artificial intelligence
The New Startup Stack: AI + Humans + Taste
20 Feb 2026
The power of AI native engineering
Artificial intelligence
The Rise of the Intuitive Developer in the Age of AI
04 Feb 2026
Next-generation AI dating app concept
Mobile App Development
The AI Features Every Dating App Needs in 2026
09 Jan 2026
Desktop App Development
Desktop App Development: A Complete Guide for 2026
10 Oct 2025
Mobile App Development
Why Sydney Startups Need a Custom Mobile App
04 Apr 2025
Tech Trends
Artificial intelligence
How AI and Machine Learning Are Revolutionising Mobile Apps
28 Mar 2025
Idea Illustration
Do you have an Idea?
Let's start, we'll take it from here.
Circle Pink
Give us a ring
9AM to 5PM (AEDT)
Call (03) 9344 1619
Circle Pink
Decades of experience
into a 30 mins call
Book a Consultation
Consultation Form
Close Button
Select a service
Please fill in this field
Error text
Please fill in this field
Please fill in this field
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.