Captions
Generate captions from your recording's audio using on-device speech-to-text. Style them in the editor, preview in real time, then burn them into the export or save as SRT/VTT.
On-device transcription
WhisperKit runs Whisper models locally through CoreML. Models are downloaded once and cached. No data leaves your machine.
WhisperKit
Runs OpenAI's Whisper models on-device through Apple's CoreML runtime. No network calls, no API keys. Your audio stays on your machine.
Four Model Sizes
Base (~140 MB), Small (~460 MB), Medium (~1.5 GB), and Large v3 (~3 GB). Downloaded on first use and cached at ~/.reframed/models/.
Apple Silicon Only
Runs on the Neural Engine, GPU, and CPU together. Requires Apple Silicon — the captions tab is hidden on Intel Macs.
Speed vs. Quality
Base transcribes at roughly 30x real time, Large at about 5x. For a 10-minute recording, that's 20 seconds vs. 2 minutes.
Available models
Model Size Speed Quality Base ~140 MB ~30x real Good for clear speech Small ~460 MB ~15x real Better with accents Medium ~1.5 GB ~8x real Noticeably more accurate Large v3 ~3 GB ~5x real Best available
Decoding options
wordTimestamps: true chunkingStrategy: .vad temperatureFallbackCount: 2 compressionRatioThreshold: 2.8 noSpeechThreshold: 0.5 computeUnits: .all (CPU + GPU + ANE)
How transcription works
Pick an audio source and a language, then hit generate. Progress updates live as WhisperKit processes 30-second windows through VAD chunking.
Audio Source Selection
Transcribe from the microphone track or system audio. If only one track exists, it's selected automatically.
Language Detection
Auto-detect or pick from 99 languages. Setting the language explicitly skips detection and speeds things up a little.
Word-level Timestamps
Every word gets its own start and end time. This is what makes per-word highlighting and line-breaking work.
Short Segment Merging
Fragments with fewer than 4 words get folded into the previous segment when the gap is under 2 seconds. Cuts down on subtitle flicker.
Caption appearance
All styling is previewed live in the editor. Font size scales proportionally so it matches between the small preview and full-resolution export.
Font Size & Weight
Size from 16 to 96 px, weight from regular to bold. Scaling is relative to a 1920 px reference width, so it looks the same in preview and export.
Colors
Text color and background color picked from a Tailwind-style palette. Background opacity is a separate slider so you can soften it without changing the color.
Position
Bottom, top, or center. Positioned relative to the video rect, not the canvas, so padding and aspect ratio changes don't push captions off-screen.
Words per Line
Controls how many words fit on one line before wrapping. Long segments get split into two-line windows that advance as the speaker progresses.
Getting captions out
Three options: burn into the video, export as SRT, or export as VTT. Pick one per export.
Burn-in
Captions are rendered into the video frames with CoreText during export. What you see in the preview is what you get in the file.
SRT Export
Standard SubRip format. Saved next to the video with the same filename. Works with YouTube, Vimeo, and most players.
WebVTT Export
VTT format for web players and browsers. Same timing data, different container.
Burn-in rendering pipeline
CameraVideoCompositor.drawCaptions(): 1. Find active segment for current frame time 2. Build visible text (word grouping + line wrapping) 3. Measure text with CTFramesetter 4. Draw background pill (rounded rect, user opacity) 5. Draw text with CTFrameDraw 6. Position relative to videoRect based on CaptionPosition
SRT output format
1 00:00:01,200 --> 00:00:04,800 First line of the transcription 2 00:00:05,100 --> 00:00:08,300 Second line of the transcription
Segment structure
Segments and their word-level timing are saved in the project file alongside all other editor state. Re-opening a project restores captions without re-transcribing.
CaptionSegment
CaptionSegment {
id: UUID
startSeconds: Double
endSeconds: Double
text: String
words: [CaptionWord]?
}CaptionSettingsData
CaptionSettingsData {
enabled: Bool
fontSize: CGFloat // 16–96
fontWeight: CaptionFontWeight
textColor: CodableColor
backgroundColor: CodableColor
backgroundOpacity: CGFloat // 0.1–1.0
showBackground: Bool
position: .bottom | .top | .center
maxWordsPerLine: Int // 2–12
model: String // e.g. "openai_whisper-base"
language: CaptionLanguage
audioSource: .microphone | .system
}