Announcing OSTT
OSTT is an open-source speech-to-text recording tool with real-time audio visualization, transcription history, and multi-provider AI transcription support.
—I built OSTT, an open-source speech-to-text recording tool for Linux and macOS.
The short version: press a hotkey, get a small terminal UI popup, speak, see that audio is actually being recorded, then send the result to a transcription provider and paste the text wherever you need it.
The longer version starts with a small frustration.
When I switched from macOS to Linux and Omarchy, one of the tools I missed immediately was a visual speech-to-text recorder. On macOS, I used a small hotkey-driven app that opened a window, showed a live audio waveform, and made it obvious that recording was active.
That visual feedback mattered more than I expected.
On Linux, I found alternatives, but most of them ran entirely in the background. Some were unstable. A few times I spoke for several minutes, only to discover that nothing had been recorded. That is exactly the kind of failure mode a dictation tool should avoid.
So I built the thing I wanted: a speech-to-text tool that makes recording visible.
A visual CLI recorder
OSTT stands for Open Speech-to-Text. It is a CLI application built in Rust with Ratatui, but the experience is intentionally visual.
In my setup, a global hotkey opens OSTT as a floating popup window in Hyprland. The UI starts recording immediately and shows a live visualization of the incoming audio. I can pause with the space bar when I need to think, resume when I am ready, and press enter to transcribe.
That combination is the whole point. It is still a terminal application. It still works as a normal command-line tool. But it gives me the confidence of a small focused UI that says: yes, this is recording, and yes, your microphone level looks right.
That is especially useful with external microphones. I wanted to see whether I was recording at a reasonable level, whether the mic was active, and whether I was clipping. Background-only tools hide all of that until it is too late.
What OSTT does
OSTT is deliberately simple at the core:
- Record audio.
- Show real-time audio feedback.
- Send the recording to a transcription provider.
- Return the text to stdout, the clipboard, or a file.
It also includes the things I kept wanting in daily use:
- A browsable transcription history, so a failed paste does not mean the text is gone.
- A keyword list for names, technical terms, and other words transcription models often miss.
- Provider and model selection through
ostt auth. - Audio device selection through
ostt list-devicesandostt config. - Support for popup workflows on Hyprland and Omarchy, while still working as a regular CLI on Linux and macOS.
The repository has the full installation and setup details, including provider support and platform-specific popup configuration.
Built with agents
One part of this project still feels slightly strange to say: OSTT is almost entirely vibe coded.
I did not sit down and hand-write the application line by line. I built it with OpenCode and AI coding agents, mostly using fast and inexpensive models, bringing in stronger models only for the more difficult issues. The result is a useful Rust application that cost very little to produce and that I now use constantly.
That is the part I find most interesting.
Agentic coding has crossed a threshold where building small, personal, high-quality tools has become dramatically cheaper. Not just prototypes. Real tools. Tools that fit your workflow closely enough that they do not need to make sense as a startup, a SaaS product, or a commercial opportunity.
OSTT exists because I wanted a very specific interaction: a hotkey-activated, visual, terminal-based dictation tool that worked well in my Linux setup. A few months ago, that might not have been worth the effort. Now it was.
That changes what is reasonable to build.
Open source
OSTT is MIT licensed and open source:
https://github.com/kristoferlund/ostt
Suggestions, issues, and contributions are welcome. I would be happy to see more providers, better platform integrations, and other improvements from people who want a visual speech-to-text workflow that stays close to the terminal.