Kristofer Lund

Announcing OSTT

OSTT is an open-source speech-to-text recording tool with real-time audio visualization, transcription history, and multi-provider AI transcription support.

I built OSTT, an open-source speech-to-text recording tool for Linux and macOS.

The short version: press a hotkey, get a small terminal UI popup, speak, see that audio is actually being recorded, then send the result to a transcription provider and paste the text wherever you need it.

The longer version starts with a small frustration.

When I switched from macOS to Linux and Omarchy, one of the tools I missed immediately was a visual speech-to-text recorder. On macOS, I used a small hotkey-driven app that opened a window, showed a live audio waveform, and made it obvious that recording was active.

That visual feedback mattered more than I expected.

On Linux, I found alternatives, but most of them ran entirely in the background. Some were unstable. A few times I spoke for several minutes, only to discover that nothing had been recorded. That is exactly the kind of failure mode a dictation tool should avoid.

So I built the thing I wanted: a speech-to-text tool that makes recording visible.

A visual CLI recorder

OSTT stands for Open Speech-to-Text. It is a CLI application built in Rust with Ratatui, but the experience is intentionally visual.

In my setup, a global hotkey opens OSTT as a floating popup window in Hyprland. The UI starts recording immediately and shows a live visualization of the incoming audio. I can pause with the space bar when I need to think, resume when I am ready, and press enter to transcribe.

That combination is the whole point. It is still a terminal application. It still works as a normal command-line tool. But it gives me the confidence of a small focused UI that says: yes, this is recording, and yes, your microphone level looks right.

That is especially useful with external microphones. I wanted to see whether I was recording at a reasonable level, whether the mic was active, and whether I was clipping. Background-only tools hide all of that until it is too late.

What OSTT does

OSTT is deliberately simple at the core:

  1. Record audio.
  2. Show real-time audio feedback.
  3. Send the recording to a transcription provider.
  4. Return the text to stdout, the clipboard, or a file.

It also includes the things I kept wanting in daily use:

  • A browsable transcription history, so a failed paste does not mean the text is gone.
  • A keyword list for names, technical terms, and other words transcription models often miss.
  • Provider and model selection through ostt auth.
  • Audio device selection through ostt list-devices and ostt config.
  • Support for popup workflows on Hyprland and Omarchy, while still working as a regular CLI on Linux and macOS.

The repository has the full installation and setup details, including provider support and platform-specific popup configuration.

Built with agents

One part of this project still feels slightly strange to say: OSTT is almost entirely vibe coded.

I did not sit down and hand-write the application line by line. I built it with OpenCode and AI coding agents, mostly using fast and inexpensive models, bringing in stronger models only for the more difficult issues. The result is a useful Rust application that cost very little to produce and that I now use constantly.

That is the part I find most interesting.

Agentic coding has crossed a threshold where building small, personal, high-quality tools has become dramatically cheaper. Not just prototypes. Real tools. Tools that fit your workflow closely enough that they do not need to make sense as a startup, a SaaS product, or a commercial opportunity.

OSTT exists because I wanted a very specific interaction: a hotkey-activated, visual, terminal-based dictation tool that worked well in my Linux setup. A few months ago, that might not have been worth the effort. Now it was.

That changes what is reasonable to build.

Open source

OSTT is MIT licensed and open source:

https://github.com/kristoferlund/ostt

Suggestions, issues, and contributions are welcome. I would be happy to see more providers, better platform integrations, and other improvements from people who want a visual speech-to-text workflow that stays close to the terminal.