Building a Personal AI Assistant Like Jarvis: A Complete Technical Guide | Parijat Software

If you grew up watching Iron Man, you probably dreamed of having your own Jarvis. An AI that just... gets you. One that listens when you need it, stays quiet when you don't, and actually does useful things instead of just answering trivia questions.

We recently built exactly that. A personal voice AI assistant that runs from the desktop, manages Gmail and Google Calendar through natural conversation, searches the web for real-time information, and responds to wake words like the sci-fi assistants we all imagined.

This project started as a prototype to validate features for Marfa, a personal AI assistant you can access over a phone call. But it quickly became something we use every day. We're open-sourcing the code so you can build your own. Or reach out if you want us to build something custom for your use case.

Here's how it all works under the hood.

See it in action:

The Tech Stack

Building a production-quality voice AI assistant requires getting several pieces right. Here's the stack we chose and why:

Voice AI Framework: LiveKit Agents

LiveKit is the backbone of this project. It's an open-source framework for building real-time voice and video AI agents, the same infrastructure that powers ChatGPT's Advanced Voice Mode.

What makes LiveKit valuable is how it handles the hard parts of voice AI: turn detection, interruption handling, streaming responses, and WebRTC connectivity. This lets you focus on assistant logic instead of audio engineering.

Speech-to-Text: Deepgram Nova-2

For transcription, we're using Deepgram's Nova-2 model. The latency is crucial for conversational AI. Anything over 300ms and interactions start feeling sluggish. Nova-2 delivers accurate transcription fast enough for real-time conversation.

Large Language Models: Multi-Provider Fallback

The assistant uses a fallback adapter pattern with multiple LLM providers:

OpenAI GPT-4.1: Primary model
Anthropic Claude Haiku: Fast fallback with caching
Google Gemini 2.5 Flash: Additional redundancy

This architecture provides reliability (if one provider has issues, others take over) and lets us experiment with different models for different interaction types.

Text-to-Speech: Cartesia Sonic-3

For voice output, Cartesia's Sonic-3 provides natural-sounding speech with low latency. We've also tested ElevenLabs for projects requiring more voice customization.

Real-Time Web Search: Perplexity API

This is what makes the assistant actually useful day-to-day. When you ask "what's the weather?" or "what happened in the news?", the assistant queries Perplexity's Search API for real-time information. No stale training data, actual current information.

Gmail & Calendar Integration: Composio

Composio handles OAuth and API integration with Google services. Through Composio, the assistant can:

Fetch and search emails
Send emails and manage drafts
Create, update, and delete calendar events
Find free time slots for scheduling
Read today's agenda

Key Features That Make It Production-Ready

Wake Word Activation

The assistant doesn't continuously process everything it hears. It stays idle until it detects its wake word, configurable to any name you want.

wakeupKeywords = [
    self.assistant_name,
    "hey " + self.assistant_name.lower(),
    "hello " + self.assistant_name.lower(),
]

After ~10 seconds of silence, the assistant enters an "away" state. It's still listening for the wake word, but won't process background audio. This saves compute costs and prevents false activations.

Voice-Controlled Mute

Privacy matters. Say "mute" and the assistant goes completely silent. It won't respond to anything except "unmute" or the wake word. Essential for calls, meetings, or whenever you need it to stay quiet.

Dynamic Tool Loading

This is where the architecture gets interesting. Instead of loading every capability at startup (which bloats the context window and slows responses), the assistant detects user intent and loads only relevant tools on-demand.

# Dynamically load tools based on detected intent
if user_message:
    intent_tools = self.tool_manager.get_tools_for_intent(user_message)

Ask about your calendar? Calendar tools load. Ask about email? Gmail tools load. Tools unused for 3 conversation turns automatically unload. It's like garbage collection for AI capabilities.

The intent detection uses keyword matching against categories:

CALENDAR_KEYWORDS = [
    "calendar", "schedule", "meeting", "appointment",
    "event", "reschedule", "free time", "availability"
]

EMAIL_KEYWORDS = [
    "email", "mail", "inbox", "message", "send",
    "reply", "draft", "unread"
]

Pre-Action Acknowledgment

When executing time-consuming operations (creating events, sending emails, searching), the assistant acknowledges immediately with natural phrases like "Let me handle that" or "Working on it" before the action completes. Small detail, but it makes interactions feel responsive.

Architecture Overview

voice_ai_assistant/
├── src/
│   ├── agent.py                    # Core assistant logic
│   ├── services/
│   │   └── perplexity_service.py   # Web search integration
│   ├── tools/
│   │   ├── web_search_tools.py     # Perplexity wrapper
│   │   ├── composio_tools_dynamic.py   # Gmail/Calendar tools
│   │   └── dynamic_tool_manager.py     # Intent-based loading
│   └── utils/
│       └── instructions.py         # System prompts
├── .env.example
└── pyproject.toml

The Assistant class extends LiveKit's Agent base class, overriding key methods:

stt_node: Intercepts transcription to handle wake word detection and mute state
llm_node: Manages dynamic tool loading based on conversation intent
on_user_turn_completed: Cleans up unused tools after each turn

Getting Started

Prerequisites

Python 3.10+
LiveKit account (cloud or self-hosted)
API keys: Deepgram, OpenAI/Anthropic/Google, Cartesia or ElevenLabs, Perplexity
Composio account with Gmail and Google Calendar connected

Installation

git clone https://github.com/MarfaAI/voice_ai_assistant.git
cd voice_ai_assistant
uv sync

Configure .env.local based on .env.example:

LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_api_key
LIVEKIT_API_SECRET=your_api_secret
COMPOSIO_API_KEY=your_composio_key
PERPLEXITY_API_KEY=your_perplexity_key

Running

python -m src.agent dev

You can find more details on project setup in the repo's readme file.

Example Interactions

Calendar:

"What's on my calendar today?"
"Schedule a meeting with the team tomorrow at 2pm"
"When am I free this week?"
"Cancel my 3 o'clock"

Email:

"Any new emails?"
"Read my latest email from Sarah"
"Send an email to the team about the project update"
"Do I have anything from legal?"

Real-Time Information:

"What's the weather?"
"Latest news about AI"
"Current Bitcoin price"
"Who won the Lakers game?"

Extending to Phone Calls

This implementation runs as a console app, but adding phone support is straightforward with LiveKit's telephony stack.

The path:

Provision a phone number (Twilio, Vonage, etc.)
Configure SIP trunking with LiveKit
Route incoming calls to your agent

Your personal Jarvis becomes reachable from any phone. We haven't included telephony in the open-source version, but we've implemented this for client projects. Reach out if you need phone-enabled voice AI.

Want Something Like This For Your Business?

This open-source project demonstrates what's possible with modern voice AI infrastructure. But every business has different needs:

Custom integrations: CRM, ERP, internal tools, databases
Industry-specific knowledge: Healthcare, legal, finance, real estate
Multi-language support: Agents that work across languages
Phone/SMS channels: Customer-facing voice bots
Compliance requirements: HIPAA, SOC2, data residency

We build production voice AI systems. From prototype to deployment, we handle the complexity so you get a solution that actually works.

Check out Marfa to try a personal AI assistant over the phone, or get in touch to discuss your project.

Code: github.com/MarfaAI/voice_ai_assistant