Stay in VS Code and transform your transcripts into professional demo videos. Speechify 3.2.0 combines Azure Speech, local CosyVoice and Qwen3-TTS voice cloning, and guided configuration to automate alignment, edit timing, and generate synced audio-visual content in one workflow.
Experience 200+ natural voices powered by Azure
The complete toolkit for AI-driven demo video production and high-quality synthesis
AI automatically analyzes your screen recordings to identify key moments and perfectly syncs your voiceover with visual changes.
Fine-tune your audio-visual sync with a built-in drag-and-drop editor. Achieve millisecond precision without leaving VS Code.
Automatically combine processed video, AI speech, and pixel-perfect subtitles (SRT) into a ready-to-share MP4 file.
AI helps adjust your script's length and pacing based on the actual video clip duration for a natural-sounding narration.
Choose between Azure neural voices, a fully local CosyVoice backend, or local Qwen3-TTS + MLX-Audio voice cloning reference-audio voice cloning directly inside VS Code.
Optimized for multilingual creators with built-in English/Chinese UI and intelligent language detection for scripts.
How we compare to traditional video editors
Stop fighting with timelines and keyframes. Your video is defined by your JSON script. Update the text, and the entire video updates. Perfect for agile technical content.
Instead of manual drag-and-drop, our Vision AI "watches" your recording to automatically align your narrative with visual changes. It's automated precision.
Stay in the tool you love. From writing documentation to producing the final demo video, everything happens within VS Code. No heavy video editors required.
Seamlessly integrate with GitHub Copilot to translate scripts. Generate localized versions of the same video content for global markets in minutes.
Speechify 3.2.0 scales from simple audio conversion to complex video production
Record your app, write a script, and let AI handle the heavy lifting of syncing your explanation with UI interactions.
Generate high-quality video tutorials with precise subtitles and professional voiceovers directly from your documentation.
Fast-track your social media presence by generating voiced-over clips and product demos with diverse AI personas.
Convert complex technical documentation into audio-visual guides that make learning more inclusive and engaging.
Explain your architecture or new features through voiced demos that highlight specific code blocks as you speak.
Build transcript-based video production pipelines that allow you to iterate on your content as fast as you write your documentation.
Choose your speech backend and get started quickly inside VS Code
Search for "Speechify" in VS Code Extensions (Ctrl+Shift+X) or click the install button above.
Open Speechify settings in VS Code and choose Azure Speech, CosyVoice (Local), or Qwen3-TTS + MLX-Audio (Local) as your generation backend.
Select text and use the context menu or Ctrl+Shift+P to run Azure: Generate Voiceover, Local CosyVoice: Generate Voiceover, or Local Qwen3-TTS: Generate Voiceover.
Voice cloning, reference recording, and local transcription without leaving VS Code
CosyVoice runs through a local FastAPI backend at speechify.cosyVoice.baseUrl (default: http://127.0.0.1:50000).
Qwen3-TTS runs directly through mlx-audio and a local Python environment configured by speechify.qwenTts.pythonPath, so no long-running server is required.
Use Local CosyVoice: Set Reference Voice or Local Qwen3-TTS: Set Reference Voice to record a sample in VS Code or select an existing audio or video file as reference media.
When you choose a video, Speechify extracts the audio automatically and can open workspace settings so you can review and edit the reference transcript.
Right-click your text and choose Local CosyVoice: Generate Voiceover or Local Qwen3-TTS: Generate Voiceover. Speechify uses the configured reference audio for cloning and saves the generated result into your project workflow.
Reference text is optional, but recommended. CosyVoice and Qwen3-TTS both produce better cloning quality when the reference audio also has a matching transcript.
Reference-media transcription runs locally. On macOS, Speechify tries Whisper MLX first for faster Apple Silicon inference, then falls back to a local Python Whisper runtime when MLX is unavailable.
In your workspace settings, make sure you configure the local backend address or Python path, the reference audio path, and the reference transcript for the provider you actually use.
The full list of CosyVoice and Qwen3-TTS configuration keys, what each one does, and complete settings.json examples are documented in the project README.
Local engine repositories:
Get help and contribute to the Speechify project
Found a bug or have a feature request? Create an issue on our GitHub repository and we'll help you resolve it quickly.
Open IssuesComprehensive guides for Azure Speech, local CosyVoice and Qwen3-TTS setup, reference-media configuration, and development documentation in multiple languages.
Read DocsHelp improve Speechify by contributing code, translations, documentation, or reporting bugs and suggesting new features.
Contribute