khala (manager)
|
+----------+----------+
| |
khala-rvc khala-core
(Python/PyTorch) |
| +---------+---------+
| | |
Unix Socket Forward Pipeline Reverse Pipeline
| (ES -> EN) (EN -> ES)
| | |
| Mic ----> GPT ----> RVC ----> BlackHole 2ch ---> Zoom
| |
| BlackHole 16ch <--- Zoom <--- B speaks |
| | |
+--------> GPT ----> Speaker |
(you hear)
Khala runs two concurrent translation pipelines plus a TUI dashboard, managed by a single binary.
Your Spanish speech is translated to English with your voice cloned via RVC.
Bluetooth mic -> capture (cpal) -> resample to 24kHz -> base64 encode
-> OpenAI GPT Realtime API (ES -> EN translation)
-> RVC voice conversion (Unix socket) -> resample to device rate
-> BlackHole 2ch (Zoom picks this up as mic)
text and audio modalitiesvirtual_output device (BlackHole 2ch)The other person’s English is translated to Spanish so you can understand them.
BlackHole 16ch (Zoom outputs B's audio here) -> capture (cpal)
-> resample to 24kHz -> base64 encode
-> OpenAI GPT Realtime API (EN -> ES translation)
-> Bluetooth speaker (you hear)
virtual_input device (BlackHole 16ch)Khala uses a queue model for natural, fluid translation:
response.create is sent immediatelyThis means:
To prevent the model from drifting into conversational mode, Khala automatically deletes conversation items after each response:
conversation.item.created)response.created: pending items move to active trackingresponse.done: active items + response output items are deleted via conversation.item.deleteThis keeps the conversation context minimal (only the system prompt), preventing:
Khala uses its own Voice Activity Detection instead of the server’s turn detection:
silence_ms, speech ends and audio is committed.turn_detection is set to null to disable server-side VAD entirely.khala-rvc Python server as a child process (if enabled)khala/
├── Cargo.toml # Workspace root
├── install.sh # Installation script
├── uninstall.sh # Uninstallation script
├── src/
│ ├── main.rs # CLI dispatch, RVC lifecycle, doctor
│ ├── cli.rs # Subcommand definitions (clap)
│ ├── config.rs # TOML config loading
│ └── ui.rs # TUI dashboard (ratatui)
├── khala-core/ # Library crate
│ └── src/
│ ├── lib.rs # Public module exports
│ ├── audio.rs # Capture/playback (cpal)
│ ├── config.rs # Pipeline config struct
│ ├── metrics.rs # Lock-free pipeline metrics
│ ├── pipeline.rs # Pipeline orchestration
│ ├── protocol.rs # OpenAI Realtime API types
│ ├── rvc.rs # RVC Unix socket client
│ └── websocket.rs # WebSocket send/receive
├── khala-config/
│ ├── config.toml # Default config template
│ └── prompt.txt # Default translation prompt
└── khala-rvc/ # Python RVC server
├── main.py # Entry point + CLI args
├── processor.py # RvcProcessor (voice conversion)
├── server.py # Asyncio Unix socket server
├── macos_compat.py # Apple Silicon workarounds
└── requirements.txt # Python dependencies
Uses the OpenAI Realtime API v1 (beta) over WebSocket:
| Direction | Event | Purpose |
|---|---|---|
| Client -> Server | session.update |
Configure session (modalities, voice, temperature, noise reduction) |
| Client -> Server | input_audio_buffer.append |
Stream audio chunks (base64 PCM16) |
| Client -> Server | input_audio_buffer.commit |
Commit buffered audio for processing |
| Client -> Server | response.create |
Trigger a translation response |
| Client -> Server | conversation.item.delete |
Clean up processed items |
| Server -> Client | response.audio.delta |
Translated audio chunks (base64 PCM16) |
| Server -> Client | response.text.delta |
Translated text chunks |
| Server -> Client | response.done |
Response completed |
| Server -> Client | conversation.item.created |
New item in conversation |
Length-prefixed binary protocol over Unix socket:
[4 bytes: payload length (u32 LE)] [payload: PCM16 i16 LE samples]