khala

Architecture

System Overview

                          khala (manager)
                               |
                    +----------+----------+
                    |                     |
              khala-rvc              khala-core
           (Python/PyTorch)               |
                    |          +---------+---------+
                    |          |                   |
                 Unix Socket   Forward Pipeline    Reverse Pipeline
                    |          (ES -> EN)          (EN -> ES)
                    |          |                   |
                    |    Mic ----> GPT ----> RVC ----> BlackHole 2ch ---> Zoom
                    |                                                     |
                    |          BlackHole 16ch <--- Zoom <--- B speaks     |
                    |               |                                     |
                    +--------> GPT ----> Speaker                         |
                                         (you hear)

Khala runs two concurrent translation pipelines plus a TUI dashboard, managed by a single binary.

Pipelines

Forward Pipeline (you speak, they hear)

Your Spanish speech is translated to English with your voice cloned via RVC.

Bluetooth mic -> capture (cpal) -> resample to 24kHz -> base64 encode
  -> OpenAI GPT Realtime API (ES -> EN translation)
  -> RVC voice conversion (Unix socket) -> resample to device rate
  -> BlackHole 2ch (Zoom picks this up as mic)

Reverse Pipeline (they speak, you hear)

The other person’s English is translated to Spanish so you can understand them.

BlackHole 16ch (Zoom outputs B's audio here) -> capture (cpal)
  -> resample to 24kHz -> base64 encode
  -> OpenAI GPT Realtime API (EN -> ES translation)
  -> Bluetooth speaker (you hear)

Queue-Based Translation

Khala uses a queue model for natural, fluid translation:

  1. You speak — audio streams to the API input buffer continuously
  2. You pause — client-side VAD detects silence, commits the audio buffer
  3. Translation starts — if no response is in progress, a response.create is sent immediately
  4. You keep speaking — new sentences queue up while the current translation plays
  5. Translation finishes — the next queued sentence starts translating automatically

This means:

Conversation Cleanup

To prevent the model from drifting into conversational mode, Khala automatically deletes conversation items after each response:

This keeps the conversation context minimal (only the system prompt), preventing:

Client-Side VAD

Khala uses its own Voice Activity Detection instead of the server’s turn detection:

Startup Flow

  1. Run pre-flight checks (silent doctor)
  2. Start khala-rvc Python server as a child process (if enabled)
  3. Wait for Unix socket ready (poll every 500ms, 60s timeout)
  4. Launch forward + reverse pipelines concurrently
  5. Start TUI dashboard (ratatui)
  6. On quit: kill RVC server, cleanup socket

Project Structure

khala/
├── Cargo.toml                          # Workspace root
├── install.sh                          # Installation script
├── uninstall.sh                        # Uninstallation script
├── src/
│   ├── main.rs                         # CLI dispatch, RVC lifecycle, doctor
│   ├── cli.rs                          # Subcommand definitions (clap)
│   ├── config.rs                       # TOML config loading
│   └── ui.rs                           # TUI dashboard (ratatui)
├── khala-core/                         # Library crate
│   └── src/
│       ├── lib.rs                      # Public module exports
│       ├── audio.rs                    # Capture/playback (cpal)
│       ├── config.rs                   # Pipeline config struct
│       ├── metrics.rs                  # Lock-free pipeline metrics
│       ├── pipeline.rs                 # Pipeline orchestration
│       ├── protocol.rs                 # OpenAI Realtime API types
│       ├── rvc.rs                      # RVC Unix socket client
│       └── websocket.rs               # WebSocket send/receive
├── khala-config/
│   ├── config.toml                     # Default config template
│   └── prompt.txt                      # Default translation prompt
└── khala-rvc/                          # Python RVC server
    ├── main.py                         # Entry point + CLI args
    ├── processor.py                    # RvcProcessor (voice conversion)
    ├── server.py                       # Asyncio Unix socket server
    ├── macos_compat.py                 # Apple Silicon workarounds
    └── requirements.txt                # Python dependencies

Communication Protocol

Rust <-> OpenAI (WebSocket)

Uses the OpenAI Realtime API v1 (beta) over WebSocket:

Direction Event Purpose
Client -> Server session.update Configure session (modalities, voice, temperature, noise reduction)
Client -> Server input_audio_buffer.append Stream audio chunks (base64 PCM16)
Client -> Server input_audio_buffer.commit Commit buffered audio for processing
Client -> Server response.create Trigger a translation response
Client -> Server conversation.item.delete Clean up processed items
Server -> Client response.audio.delta Translated audio chunks (base64 PCM16)
Server -> Client response.text.delta Translated text chunks
Server -> Client response.done Response completed
Server -> Client conversation.item.created New item in conversation

Rust <-> Python RVC (Unix Socket)

Length-prefixed binary protocol over Unix socket:

[4 bytes: payload length (u32 LE)] [payload: PCM16 i16 LE samples]