Transcribe_Subs

Generate .srt subtitles for videos that have no subtitles, using local AI speech-to-text and LLM-powered cleanup.

Whisper runs locally on your NVIDIA GPU — no audio leaves your machine. Only the transcribed text is sent to an LLM API for punctuation and cleanup, at a cost of roughly $0.06 per movie.

Features

Fully automatic — point at a folder, get subtitles for every video without subs
Local speech-to-text — audio never leaves your machine (Whisper on your GPU)
AI-polished output — LLM fixes punctuation, misheard words, and filler
Cheap — ~$0.06 per movie via DeepSeek API (recommended; other providers also supported)
Smart filtering — skips videos that already have subtitles (embedded or sidecar)
Cross-platform — Windows and Linux (Debian/Ubuntu) supported
Netflix-style formatting — 42 chars/line, 2-line max, proper timing and reading speed

How It Works

Video file
  │
  ▼
[1] Whisper (local GPU) ─── speech to text with word-level timestamps
  │
  ▼
[2] LLM Punctuation ─── add periods, commas, capitals to raw transcript
  │
  ▼
[3] Re-segmentation ─── rebuild entries at sentence boundaries (code)
  │
  ▼
[4] LLM Cleanup ─── fix misheard words, remove filler
  │
  ▼
[5] Post-processing ─── timing rules, merge, line wrap, validate (code)
  │
  ▼
Polished .srt file

What it does:

Scans a folder recursively for video files (MKV, MP4, AVI, MOV, WebM)
Skips any video that already has subtitles (embedded or sidecar .srt)
Transcribes speech using faster-whisper with the large-v3 model
Fixes punctuation and misheard words via LLM (DeepSeek Chat or any OpenAI-compatible API)
Formats output following professional subtitle standards (42 chars/line, 2-line max, readable pacing)

Cost

Step	Cost
Whisper transcription	Free (runs on your GPU)
LLM processing — 45-min episode	~$0.03
LLM processing — 2-hr movie	~$0.06
LLM processing — 10-episode season	~$0.30

Note

These estimates are based on DeepSeek Chat pricing as of April 2026 (the default and recommended provider). Pricing may change over time. Costs will also vary if you use a different provider or model — OpenAI, Anthropic Claude, and other premium models can cost significantly more. DeepSeek Chat offers the best cost/quality balance for this task. Always check your provider's current pricing before processing large libraries.

The --skip-llm flag exists for testing but is not recommended — raw Whisper output lacks proper punctuation and sentence boundaries, making it difficult to read as subtitles. The LLM passes are essential for usable output.

Installation

Choose your platform:

Requirements (both platforms)

Component	Required	Notes
NVIDIA GPU	Recommended	CPU works but is 10-20x slower
Python 3.11+	Yes	Installed by the script if missing
Git	Yes	Installed by the script if missing
FFmpeg	Yes	Installed by the script if missing
LLM API key	Yes	DeepSeek recommended (pipeline is tested with it). Other OpenAI-compatible providers also work

Both installers automatically detect and install missing dependencies.

Windows Install

Prerequisites

Windows 10 or 11
PowerShell 5.1+ (built into Windows)
Anaconda or Miniconda (the installer can install Miniconda for you)
NVIDIA drivers (the installer can install them for you)

Step 1: Run the installer

Open PowerShell, navigate to where you want to install, and run the one-liner:

cd D:\
irm https://raw.githubusercontent.com/dexusno/Transcribe-Subs/main/install.ps1 | iex

This creates a Transcribe_Subs folder in your current directory (e.g. D:\Transcribe_Subs).

The installer will:

Check for NVIDIA GPU, drivers, and CUDA
Check for conda, git, ffmpeg — offer to install via winget if missing
Clone the repository
Create an isolated transcribe_subs conda environment with Python 3.11
Install all Python dependencies
Verify CUDA works end-to-end — automatically fix missing runtime libraries
Pre-download the Whisper large-v3 model (~3 GB, one-time download)
Create config files from templates

If the installer installs NVIDIA drivers, you will need to restart your computer and run the installer again.

Step 2: Get your LLM API key

DeepSeek is recommended (see Configuration below), but any OpenAI-compatible provider works.

Step 3: Run it

cd D:\Transcribe_Subs
.\transcribe_subs.ps1 "D:\Movies\Some Movie"

The wrapper automatically activates the conda environment.

Linux Install

Prerequisites

Debian 13, Ubuntu 22.04+, or similar
sudo access (for installing system packages)
NVIDIA drivers: sudo apt install nvidia-driver (and reboot afterwards)

Step 1: Run the installer

Open a terminal, navigate to where you want to install, and run the one-liner:

cd /opt
curl -fsSL https://raw.githubusercontent.com/dexusno/Transcribe-Subs/main/linux/install.sh | bash

This creates a Transcribe_Subs folder in your current directory (e.g. /opt/Transcribe_Subs).

The installer will:

Check for NVIDIA GPU and drivers
apt-get install system dependencies (python3, python3-venv, git, ffmpeg)
Clone the repository
Create a Python venv at .venv/ inside the project
Install all Python dependencies
Verify CUDA works end-to-end — automatically fix missing runtime libraries
Pre-download the Whisper large-v3 model (~3 GB, one-time download)
Create config files from templates

If you need to install NVIDIA drivers first: sudo apt install nvidia-driver && sudo reboot

Step 2: Get your LLM API key

DeepSeek is recommended (see Configuration below), but any OpenAI-compatible provider works.

Step 3: Run it

cd /opt/Transcribe_Subs
./linux/transcribe_subs.sh "/media/movies/Some Movie"

The wrapper runs the venv Python directly — no manual activation needed.

Configuration

Get an LLM API key

The LLM is a required part of the pipeline — it adds punctuation to the raw transcript and fixes speech recognition errors. Without it, the output lacks sentence boundaries and is not usable as subtitles.

DeepSeek is recommended — the pipeline is developed and tested with DeepSeek Chat. It offers the best balance of quality, speed, and cost (~$0.03 per episode). To set up DeepSeek:

Go to platform.deepseek.com/api_keys
Create an account (or sign in)
Add some credit ($2-5 is enough for hundreds of episodes)
Generate a new API key
Copy the key (it starts with sk-)

Other providers also work — OpenAI, Groq, OpenRouter, and any OpenAI-compatible API are supported. See LLM Profiles for how to configure them. Results may vary as the prompts are optimised for DeepSeek, and premium providers can cost 10-50x more.

Add your key to .env

Open the .env file in the project directory and replace your-key-here:

DEEPSEEK_API_KEY=sk-abc123your-actual-key-here

Save the file. That's all the configuration needed to start using the tool.

Security: The .env file is in .gitignore and will never be committed to git. Your API key stays local.

Usage

Windows

# Basic — generate subtitles for all videos without subs
.\transcribe_subs.ps1 "D:\Movies\Inception (2010)"

# Preview — see what would be processed
.\transcribe_subs.ps1 "D:\Movies" -DryRun

# Force language — skip auto-detection
.\transcribe_subs.ps1 "D:\TvSeries\Breaking Bad" -Language en

# Smaller model — faster but less accurate (good for testing)
.\transcribe_subs.ps1 "D:\Movies" -WhisperModel medium

# Limit files
.\transcribe_subs.ps1 "D:\Movies" -Limit 5

# Force re-transcribe (overwrite existing .srt)
.\transcribe_subs.ps1 "D:\Movies" -Force

# Use a different LLM
.\transcribe_subs.ps1 "D:\Movies" -Profile openai

# Network share (UNC paths supported)
.\transcribe_subs.ps1 "\\NAS\Media\Movies"

Linux

# Basic — generate subtitles for all videos without subs
./linux/transcribe_subs.sh "/media/movies/Inception (2010)"

# Preview — see what would be processed
./linux/transcribe_subs.sh "/media/movies" --dry-run

# Force language — skip auto-detection
./linux/transcribe_subs.sh "/media/tv/Breaking Bad" --language en

# Smaller model — faster but less accurate
./linux/transcribe_subs.sh "/media/movies" --whisper-model medium

# Limit files
./linux/transcribe_subs.sh "/media/movies" --limit 5

# Force re-transcribe (overwrite existing .srt)
./linux/transcribe_subs.sh "/media/movies" --force

# Use a different LLM
./linux/transcribe_subs.sh "/media/movies" --profile openai

# Network share
./linux/transcribe_subs.sh "/mnt/nas/movies"

CLI Options

Windows flag	Linux flag	Description	Default
`folder`	`folder`	Path to scan for video files	Required
`-Profile`	`--profile`	LLM profile from llm_config.json	`deepseek`
`-BatchSize`	`--batch-size`	Subtitle entries per LLM API call	150
`-Parallel`	`--parallel`	Concurrent file processing threads	4
`-WhisperModel`	`--whisper-model`	Override Whisper model size	`large-v3`
`-Language`	`--language`	Language code (e.g. `en`, `es`, `fr`)	`en`
`-Limit`	`--limit`	Max number of files to process	unlimited
`-Force`	`--force`	Re-transcribe even if `.srt` exists	off
`-DryRun`	`--dry-run`	Show what would be processed	off
`-SkipLLM`	`--skip-llm`	Output raw Whisper without LLM cleanup	off
`-KeepWhisper`	`--keep-whisper`	Keep the raw `.whisper` cache file after processing	off
`-LogFile`	`--log-file`	Also write log output to this file	none

Output

For each video without subtitles, the tool creates a language-tagged .srt file next to it:

Movies/Inception (2010)/
    Inception.mkv
    Inception.en.srt       ← generated (language auto-detected or from -Language flag)

The language code (e.g. .en, .fr, .de) is detected automatically by Whisper or set manually with -Language. Media servers like Plex, Jellyfin, and Kodi recognise this naming convention and display the correct language label.

The .srt file:

Uses UTF-8 encoding with BOM for maximum player compatibility
Follows SRT format specification
Works in VLC, MPC-HC, Plex, Jellyfin, Kodi, and all standard media players

Advanced Configuration

All settings live in llm_config.json in the project directory.

Whisper Settings

"whisper": {
    "model": "large-v3",
    "device": "cuda",
    "compute_type": "int8",
    "language": "en",
    "beam_size": 10,
    "best_of": 5,
    "patience": 2.0,
    "vad_filter": false,
    "word_timestamps": true,
    "condition_on_previous_text": false
}

Setting	Description	Options
`model`	Whisper model size	`tiny`, `base`, `small`, `medium`, `large-v2`, `large-v3`
`device`	Run on GPU or CPU	`cuda`, `cpu`
`compute_type`	Precision (lower = faster, less VRAM)	`int8`, `float16`, `float32`
`language`	Language for transcription	`"en"` (default), `"es"`, `"fr"`, `null` (auto-detect)
`beam_size`	Beam search width (higher = more accurate)	`1`-`15`
`best_of`	Candidates per segment (picks the best)	`1`-`5`
`patience`	Beam search patience	`1.0`-`3.0`
`vad_filter`	Voice Activity Detection filter	`false` (default), `true`
`word_timestamps`	Word-level timing for precise segmentation	`true` (required)
`condition_on_previous_text`	Feed previous text as context to next chunk	`false` (default)

Subtitle Rules

"subtitle_rules": {
    "max_chars_per_line": 42,
    "max_lines": 2,
    "target_cps": 17,
    "max_cps": 20,
    "min_duration_ms": 1000,
    "max_duration_ms": 7000,
    "min_gap_ms": 83
}

Warning

These defaults are tuned to work with the current pipeline logic. Changing values like max_lines or max_chars_per_line may require changes to the code — the line wrapping, entry merging, and sentence splitting logic are built around these specific values. Only modify these if you understand the impact on the processing pipeline.

Rule	Default	What it does	How it's enforced
`max_chars_per_line`	42	Preferred max characters per line	Soft limit — lines may slightly exceed to avoid dropping words
`max_lines`	2	Maximum lines per subtitle entry	Hard limit
`target_cps`	17	Target reading speed (chars/sec)	Used for character budget calculations
`max_cps`	20	Maximum reading speed	Logged as warning, not hard-enforced
`min_duration_ms`	1000	Shortest a subtitle can display (ms)	Extended to meet minimum when possible
`max_duration_ms`	7000	Longest a subtitle can display (ms)	Long entries split in pre-processing
`min_gap_ms`	83	Minimum gap between subtitles (ms)	Hard-enforced (83ms = 2 frames at 24fps)

LLM Profiles

The profiles section in llm_config.json defines LLM backends. You can use any OpenAI-compatible API.

Default profile (DeepSeek Chat):

"deepseek": {
    "api_url": "https://api.deepseek.com/v1/chat/completions",
    "model": "deepseek-chat",
    "api_key_env": "DEEPSEEK_API_KEY",
    "batch_size": 150,
    "parallel": 4,
    "timeout": 120
}

Included profiles: deepseek (default, recommended), openai, groq, openrouter, local (Ollama/LM Studio)

To use a different cloud provider, add its API key to .env and pass the profile flag:

# Windows
.\transcribe_subs.ps1 "D:\Movies" -Profile openai

# Linux
./linux/transcribe_subs.sh "/media/movies" --profile openai

Caution

Local LLMs (Ollama, LM Studio) require very capable hardware and large models (14B+ parameters) to produce acceptable results. Small local models (7B and below) will struggle with the punctuation and error correction tasks and may produce poor quality subtitles. The cloud APIs are strongly recommended unless you have high-end hardware (48GB+ VRAM) and can run models like Qwen 2.5 72B or Llama 3 70B.

Under the Hood

Click to expand — explains the engineering decisions behind the pipeline

Why 5 Passes?

Early versions used a simpler pipeline (Whisper → single LLM pass → done) but produced poor results: sentence bleeding, dropped words, garbled punctuation. Each problem required a focused solution, leading to the current 5-pass architecture where each step does one thing well.

Pass 1: Whisper Transcription

Model: large-v3 with INT8 quantization — best accuracy while using only ~3-4 GB VRAM.

Word-level timestamps: Instead of trusting Whisper's segment boundaries (which often dump 6-25 seconds of multi-speaker dialogue into a single block), we extract per-word timestamps and rebuild subtitle entries ourselves.

condition_on_previous_text=false: The default (true) feeds previous output as context to the next chunk, which gives better punctuation consistency but causes catastrophic cascading errors — if Whisper mishears a place name once, it repeats the error for the entire file. We tested both extensively. With false, each 30-second chunk is independent.

VAD disabled: Silero VAD was filtering out 30 minutes of a 57-minute episode — removing quiet dialogue along with silence. Whisper's own silence handling plus our hallucination detection in post-processing is sufficient.

Quality settings (beam_size=10, best_of=5, patience=2.0): We prioritise accuracy over speed. This is a one-shot-per-file tool for content with no subtitles at all, so spending extra time is acceptable.

Pass 2: LLM Punctuation

The problem: With condition_on_previous_text=false, Whisper produces ~27% of entries with no punctuation at all — entire passages of lowercase text with no periods, commas, or capitals.

Why not a simple approach? We tried several:

Having the LLM punctuate per-entry with [N] indexing — it added false periods at entry boundaries because it couldn't see that sentences continued in the next entry
Sending continuous text and remapping by word count — fragile, any word change breaks alignment
NLP libraries (spaCy, PySBD) — adds dependencies and doesn't work well on unpunctuated text

The solution: Overlapping windows with [N] indexing. Each batch includes 20 entries of context overlap from the previous batch. The LLM sees flowing dialogue across entries and can detect where sentences truly end. Only the non-overlapping portion of each response is used, giving us perfect [N] mapping back to entries.

Strict prompt: The LLM is instructed with emphatic language ("You MUST read ALL lines THOROUGHLY", "Do NOT place a period just because a line ends", "Read AHEAD to find where each sentence actually ends"). Testing showed this produces significantly better results than polite instructions.

ALL CAPS normalisation: Whisper sometimes outputs entire sections in ALL CAPS (dramatic scenes, shouting). Before the LLM sees the text, we convert any all-uppercase entries to lowercase — the LLM adds proper capitalisation back.

Pass 3: Sentence Re-Segmentation

After punctuation, the code rebuilds entries from scratch at sentence boundaries. This is pure code — no LLM involved.

Flatten all entries into a word stream with timestamps
Group words into sentences using punctuation (. ! ?)
Sentences that fit within limits (84 chars, 7 seconds) become one entry
Long sentences are split at clause boundaries with tiered preference:
- Tier 1: comma + conjunction ("..., but") — best
- Tier 2: after any comma
- Tier 3: before a conjunction without comma
- Tier 4: nearest to midpoint — last resort

An abbreviation list (Mr., Mrs., Dr., D.I., etc.) prevents false sentence splits.

The key insight: Build sentences first, then fit them into entries. The previous approach (build entries, hope sentences align) caused constant bleeding.

Pass 4: LLM Cleanup

Now that entries contain clean, properly punctuated sentences, the LLM can focus purely on fixing speech recognition errors:

Misheard words: "lorry ticket" → "lottery ticket", "tandem gloid" → "tandem glider"
Filler words: um, uh, er, like, you know, I mean
Stuttering: "it's it's important" → "it's important"
False starts: "I was— I went there" → "I went there"

The prompt explicitly says: "Do NOT remove, shorten, or rephrase anything else. Keep every word." Earlier versions asked the LLM to also condense text to fit character budgets, but LLMs cannot count characters and were dropping words. The condensation was moved to code.

Pass 5: Post-Processing

All formatting is done by code, not the LLM:

Hallucination detection:

Speaking speed: 3+ words in under 0.5 seconds is physically impossible — removed
Speed limit: over 12 words/second is beyond any human speech — removed
Metadata patterns: credit lines, website names, music symbols, copyright lines
Consecutive duplicate entries

Entry merging: Consecutive short entries (each under 42 chars, under 2.5 seconds, gap < 0.5s) are merged into 2-line entries. This reduces subtitle flickering during rapid dialogue.

Line wrapping: Entries over 42 characters are split into 2 lines using a scoring system:

Balance score (lines should be roughly equal length)
Inverted pyramid bonus (bottom line ≥ top line)
Natural break points (conjunctions, prepositions, after punctuation)
Overflow penalty (soft limit — a 45-char line is penalised but never truncated, because dropping words is worse than a slightly long line)

Timing enforcement: Min 1 second display, max 7 seconds, min 83ms gap between entries, CPS checking.

Whisper Cache (.whisper files)

Raw Whisper output is saved as a .whisper file next to the video. If the LLM pass fails or the user aborts, subsequent runs skip Whisper entirely and go straight to LLM processing. The .whisper extension was chosen because it's not a real subtitle format — no media player or scanner will detect it.

Why Each LLM Pass Does Only One Thing

Through testing, we discovered that LLMs forget complex multi-step instructions after processing the first few entries in a batch. A prompt that says "fix punctuation AND condense to fit budget AND correct misheard words" works on entries 1-5 and degrades from there. Two passes with simple, focused instructions ("add punctuation" then "fix errors") produce dramatically better results across entire batches.

Troubleshooting

"No API key found" Make sure .env exists in the project directory and has your key. See Configuration.

"Failed to extract audio" FFmpeg is not installed or not in PATH.

Windows: winget install ffmpeg then restart your terminal
Linux: sudo apt install ffmpeg

Whisper is very slow You may be running on CPU. Check the log output for device=cuda. Re-run the installer (install.ps1 / install.sh) to diagnose GPU/CUDA issues.

Subtitles have wrong language Whisper auto-detects language. Force it with -Language en (or --language en on Linux). Supports all Whisper language codes: en, es, fr, de, ja, zh, etc.

Place names or character names are wrong This is a Whisper limitation. The speech-to-text engine sometimes mishears proper nouns that are uncommon. The LLM cleanup catches many of these using context but can't get them all.

Linux: nvidia-smi: command not found Install drivers: sudo apt install nvidia-driver and reboot. Then re-run install.sh.

Linux: venv activation fails The wrapper doesn't activate the venv — it runs .venv/bin/python directly. If you want to activate manually: source .venv/bin/activate.

Disclaimer

This software is provided as-is, without warranty of any kind. By using Transcribe_Subs, you acknowledge the following:

Transcription accuracy — subtitles are generated by AI speech recognition (Whisper) and cleaned up by third-party LLM APIs. Output quality depends on audio quality, accents, background noise, and the speech recognition model. Always spot-check subtitles before relying on them for critical use.
API costs — cloud LLM providers charge per token. While costs are low (~$0.03 per episode), processing a very large library will accumulate charges. Use --dry-run to preview what will be processed before committing.
Third-party services — this tool sends transcribed text (not audio) to external APIs (DeepSeek, OpenAI, etc.) for processing. Do not use it on content you are not authorised to share with these services.
File creation — this tool creates .srt and .whisper files alongside your video files. It does not modify or delete your original media files.
Legal responsibility — you are solely responsible for ensuring your use of this tool complies with applicable laws, including copyright and content licensing. The authors of this project are not responsible for how it is used.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
linux		linux
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
install.ps1		install.ps1
llm_config.example.json		llm_config.example.json
requirements.txt		requirements.txt
transcribe_subs.ps1		transcribe_subs.ps1
transcribe_subs.py		transcribe_subs.py

Folders and files

Latest commit

History

Repository files navigation

Transcribe_Subs

Table of Contents

Features

How It Works

Cost

Installation

Requirements (both platforms)

Windows Install

Prerequisites

Step 1: Run the installer

Step 2: Get your LLM API key

Step 3: Run it

Linux Install

Prerequisites

Step 1: Run the installer

Step 2: Get your LLM API key

Step 3: Run it

Configuration

Get an LLM API key

Add your key to .env

Usage

Windows

Linux

CLI Options

Output

Advanced Configuration

Whisper Settings

Subtitle Rules

LLM Profiles

Under the Hood

Why 5 Passes?

Pass 1: Whisper Transcription

Pass 2: LLM Punctuation

Pass 3: Sentence Re-Segmentation

Pass 4: LLM Cleanup

Pass 5: Post-Processing

Whisper Cache (.whisper files)

Why Each LLM Pass Does Only One Thing

Troubleshooting

Disclaimer

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages