Editor

Captions

Generate captions from your recording's audio using on-device speech-to-text. Style them in the editor, preview in real time, then burn them into the export or save as SRT/VTT.

Speech-to-text

On-device transcription

WhisperKit runs Whisper models locally through CoreML. Models are downloaded once and cached. No data leaves your machine.

WhisperKit

Runs OpenAI's Whisper models on-device through Apple's CoreML runtime. No network calls, no API keys. Your audio stays on your machine.

Four Model Sizes

Base (~140 MB), Small (~460 MB), Medium (~1.5 GB), and Large v3 (~3 GB). Downloaded on first use and cached at ~/.reframed/models/.

Apple Silicon Only

Runs on the Neural Engine, GPU, and CPU together. Requires Apple Silicon — the captions tab is hidden on Intel Macs.

Speed vs. Quality

Base transcribes at roughly 30x real time, Large at about 5x. For a 10-minute recording, that's 20 seconds vs. 2 minutes.

Available models

Model       Size      Speed        Quality
Base        ~140 MB   ~30x real    Good for clear speech
Small       ~460 MB   ~15x real    Better with accents
Medium      ~1.5 GB   ~8x real     Noticeably more accurate
Large v3    ~3 GB     ~5x real     Best available

Decoding options

wordTimestamps:             true
chunkingStrategy:           .vad
temperatureFallbackCount:   2
compressionRatioThreshold:  2.8
noSpeechThreshold:          0.5
computeUnits:               .all (CPU + GPU + ANE)
Transcription

How transcription works

Pick an audio source and a language, then hit generate. Progress updates live as WhisperKit processes 30-second windows through VAD chunking.

Audio Source Selection

Transcribe from the microphone track or system audio. If only one track exists, it's selected automatically.

Language Detection

Auto-detect or pick from 99 languages. Setting the language explicitly skips detection and speeds things up a little.

Word-level Timestamps

Every word gets its own start and end time. This is what makes per-word highlighting and line-breaking work.

Short Segment Merging

Fragments with fewer than 4 words get folded into the previous segment when the gap is under 2 seconds. Cuts down on subtitle flicker.

Styling

Caption appearance

All styling is previewed live in the editor. Font size scales proportionally so it matches between the small preview and full-resolution export.

Font Size & Weight

Size from 16 to 96 px, weight from regular to bold. Scaling is relative to a 1920 px reference width, so it looks the same in preview and export.

Colors

Text color and background color picked from a Tailwind-style palette. Background opacity is a separate slider so you can soften it without changing the color.

Position

Bottom, top, or center. Positioned relative to the video rect, not the canvas, so padding and aspect ratio changes don't push captions off-screen.

Words per Line

Controls how many words fit on one line before wrapping. Long segments get split into two-line windows that advance as the speaker progresses.

Export

Getting captions out

Three options: burn into the video, export as SRT, or export as VTT. Pick one per export.

Burn-in

Captions are rendered into the video frames with CoreText during export. What you see in the preview is what you get in the file.

SRT Export

Standard SubRip format. Saved next to the video with the same filename. Works with YouTube, Vimeo, and most players.

WebVTT Export

VTT format for web players and browsers. Same timing data, different container.

Burn-in rendering pipeline

CameraVideoCompositor.drawCaptions():
  1. Find active segment for current frame time
  2. Build visible text (word grouping + line wrapping)
  3. Measure text with CTFramesetter
  4. Draw background pill (rounded rect, user opacity)
  5. Draw text with CTFrameDraw
  6. Position relative to videoRect based on CaptionPosition

SRT output format

1
00:00:01,200 --> 00:00:04,800
First line of the transcription

2
00:00:05,100 --> 00:00:08,300
Second line of the transcription
Data

Segment structure

Segments and their word-level timing are saved in the project file alongside all other editor state. Re-opening a project restores captions without re-transcribing.

CaptionSegment

CaptionSegment {
  id:            UUID
  startSeconds:  Double
  endSeconds:    Double
  text:          String
  words:         [CaptionWord]?
}

CaptionSettingsData

CaptionSettingsData {
  enabled:            Bool
  fontSize:           CGFloat        // 16–96
  fontWeight:         CaptionFontWeight
  textColor:          CodableColor
  backgroundColor:    CodableColor
  backgroundOpacity:  CGFloat        // 0.1–1.0
  showBackground:     Bool
  position:           .bottom | .top | .center
  maxWordsPerLine:    Int            // 2–12
  model:              String         // e.g. "openai_whisper-base"
  language:           CaptionLanguage
  audioSource:        .microphone | .system
}