
You want to add audio-to-text to your product, and the first idea that surfaces in every planning thread is the same one: a common first idea is to use OpenAI Whisper and run it on your own infrastructure, but in most cases teams only deploy the pre-trained model for inference rather than training or fine-tuning it. For a small number of teams, that pays off. For most solo devs and startups, it turns a one-afternoon feature into a multi-week infrastructure project.
This guide shows the faster path. You can build a transcriber by calling a hosted transcription API, which gives you audio-to-text, speaker labels, and 100+ languages from a handful of HTTP requests instead of a model you have to host and scale yourself. The examples use Transkriptor's REST API, though the same pattern carries over to any modern speech-to-text service.
Why Call a Transcription API Instead of Self-Hosting Whisper?
The short answer is time and total cost. Whisper is a strong open-source model, and the weights are free, so the model is rarely the expensive part. The infrastructure around it is what costs you.
Run Whisper yourself, and you sign up for a GPU bill that starts around $150 to $400 a month for modest cloud instances and climbs fast under real load. You also own scaling, latency tuning, retries, and model updates. Whisper large-v3 typically requires significantly more than 3GB of VRAM for FP16 inference in practical setups, and batching support is limited in the original reference implementation, often requiring external optimization frameworks.
There is a bigger gap that surprises most teams. Whisper transcribes the words, but identifying who spoke is a separate problem. Speaker diarization runs as its own model, usually pyannote-audio wired in through WhisperX, which means a second set of weights, gated Hugging Face terms to accept, and two models fighting over the same GPU memory. Some companies have reported that maintaining open-source diarization pipelines (such as those built on WhisperX and pyannote-audio) can require substantial engineering effort over time, though exact resource estimates vary and are not consistently documented.
Even OpenAI has moved on, releasing gpt-4o-transcribe and now steering new API users toward its newer speech models over the original Whisper.
A hosted API folds all of that into one REST call. You send audio, you get back text with speaker labels and timestamps across 100+ languages, and someone else owns the GPUs. You could reach for OpenAI's gpt-4o-transcribe, Deepgram, or AssemblyAI for the same reason. The examples below use Transkriptor because it returns speaker-labeled transcripts and handles files, URLs, and live meetings through one consistent interface.
How Do You Wire Up the Transcription API?
Uploading a local file takes 3 steps: ask for a secure upload URL, push the file to that URL, then initiate the transcription job. The upload happens against a temporary URL rather than the API server itself, which keeps large media files off your request path and scales better. Here is the complete flow to wire up Transkriptor API in Python:
import json
import requests
# Step 1: Obtain the Upload URL
url = "https://api.tor.app/developer/transcription/local_file/get_upload_url"
# Replace with your actual API key
api_key = "your_api_key"
# Set up the headers, including the API key
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}",
"Accept": "application/json",
}
# Request body with the file name
body = json.dumps({"file_name": "your_file_name"})
# Request to get the upload URL
response = requests.post(url, headers=headers, data=body)
if response.status_code == 200:
response_json = response.json()
upload_url = response_json["upload_url"]
public_url = response_json[
"public_url"
] # URL to pass in initiate transcription step
print("Upload URL obtained:", upload_url)
print("Public URL obtained:", public_url)
else:
print("Failed to get upload URL:", response.status_code, response.text)
exit()
# Step 2: Upload the Local File
file_path = "path/to/your/file.mp3" # Replace with your actual file path
with open(file_path, "rb") as file_data:
upload_response = requests.put(upload_url, data=file_data)
if upload_response.status_code == 200:
print("File uploaded successfully")
else:
print("File upload failed:", upload_response.status_code, upload_response.text)
exit()
# Step 3: Initiate Transcription for the Uploaded File
initiate_url = (
"https://api.tor.app/developer/transcription/local_file/initiate_transcription"
)
# Configure transcription parameters
config = json.dumps(
{
"url": public_url, # Passing public_url to initiate transcription
"language": "en-US",
"service": "Standard",
# "folder_id": "your_folder_id", # Optional folder_id
# "triggering_word": "example", # Optional triggering_word
}
)
# Send request to initiate transcription
transcription_response = requests.post(initiate_url, headers=headers, data=config)
if transcription_response.status_code == 202:
transcription_json = transcription_response.json()
print(transcription_json["message"])
print("Order ID:", transcription_json["order_id"])
else:
print(
"Failed to initiate transcription:",
transcription_response.status_code,
transcription_response.text,
)
How Do You Handle Multiple Languages and Speaker Labels?
Two features separate a basic transcriber from one people rely on: language coverage and knowing who said what.
Languages are a single parameter. Pass an ISO code like en-US, es-ES, or ja-JP in the language field when you start the job, and the service handles the rest across 100+ languages. Swap the code and the same pipeline transcribes Spanish, Japanese, or Hindi with no extra setup. For mixed or unknown audio, you can let the service detect the language automatically instead of hard-coding it.
Speaker labels are where the hosted route saves the most work. Self-hosted Whisper needs an entire second model for this, while the Transkriptor API includes speaker segmentation in the result you already fetched. The response breaks the transcript into segments, each tagged with a speaker identifier and a timestamp, so you can render a fully attributed transcript directly in your UI.
If you need a downloadable file, the Export Transcription endpoint lets you choose your format: TXT, SRT, PDF, or DOCX. It gives you control over whether to include speaker names and timestamps in the output. For recordings with overlapping voices, Transkriptor also exposes a dedicated speaker recognition endpoint. The custom vocabulary feature is worth setting up early if your audio contains product names, medical terms, or acronyms, since a generic model tends to mangle exactly the words your users search for most.
When Does Training or Self-Hosting Your Own Model Actually Make Sense?
Training or self-hosting a model only makes sense in a few specific situations where API usage is no longer practical. In all other cases, using an API is still the default and most efficient option.
- Strict data rules: If audio cannot leave your infrastructure for legal or compliance reasons, self-hosted Whisper keeps everything in-house.
- Heavy, steady volume: Per-minute pricing stays cheap until it does not. Above a few hundred hours a month, a reserved GPU can undercut metered calls.
- Offline or edge needs: Apps that transcribe on-device or in air-gapped environments cannot lean on a cloud call.
- A real ML team: If you already run GPU infrastructure and have engineers who enjoy this work, the operational cost is one you have already absorbed.
For everyone else, especially a solo dev shipping a transcriber feature this sprint, the math favors the API. You trade a per-minute fee for weeks of saved engineering and a feature that works today.
Conclusion
Adding audio-to-text used to mean a research project. Today, a hosted transcriber turns it into an afternoon of wiring up REST calls, and you walk away with speaker labels and 100+ languages without renting a single GPU. Start with the API, ship the feature, and revisit self-hosting only when data rules or volume genuinely demand it.
If you want to try the flow above, Transkriptor's developer docs cover the upload, meeting, webhook, and export endpoints with copy-paste snippets in Python, JavaScript, Go, and more, so you can drop the calls straight into whatever stack you already run.
Comments
Loading comments…