Building a Real-Time Voice Assistant: Integrating Speech Recognition, Text-to-Speech, and LLMs

3 min readDec 30, 2024

Creating a responsive, real-time AI voice assistant has become increasingly accessible with modern libraries and APIs. In this blog, we’ll explore a Python implementation that merges speech recognition, text-to-speech (TTS), and a custom Large Language Model (LLM) prompt to bring a Jarvis-inspired assistant to life.

Overview of the System

This assistant listens to user input, processes the speech into text, interacts with an LLM, and replies both audibly and textually. The assistant, named “Jarvis,” is designed to mimic the persona from the Iron Man universe — polite, witty, and succinct.

Key Components

The implementation combines several technologies:

Speech Recognition: Using the speech_recognition library to transcribe user speech.
Text-to-Speech (TTS): Utilizing pyttsx3 for real-time audio responses.
Custom AI Prompting: Employing LangChain and Ollama LLM for generating intelligent, context-aware responses.

Let’s dive into the code, step-by-step.

1. Setting the Foundation

First, the necessary libraries are imported, and environment variables are loaded using dotenv to manage sensitive credentials.

import speech_recognition as sr
import pyttsx3
import os
from dotenv import load_dotenv
load_dotenv()

This ensures a secure and modular setup for projects involving API keys or sensitive configurations.

2. Crafting Jarvis’ Personality

The assistant’s behavior is shaped using a ChatPromptTemplate, ensuring that responses stay in character.

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
template = """
You are Jarvis, Charan's personal AI assistant from the Iron Man universe.
Your style is polite, witty, and succinct.
You address the user respectfully as \"Sir,\" or by name if provided.
You add subtle humor where appropriate, and you always stay in character as a resourceful AI.
Keep the responses short and to the point, and avoid overly verbose or complex replies.
Context / Conversation so far:
{history}
User just said: {question}
Now, Jarvis, please reply:
"""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="llama3.1")
chain = prompt | model

This defines a “Jarvis persona” and sets up the LLM to generate responses in context with the conversation history.

3. Speech Synthesis with `pyttsx3`

To ensure the assistant responds audibly, the following function initializes a TTS engine:

def SpeakText(text):
    engine = pyttsx3.init()
    voices = engine.getProperty("voices")
    engine.setProperty("rate", 180)  # Adjust speed
    engine.setProperty("volume", 1.0)  # Max volume
    engine.say(text)
    engine.runAndWait()

Custom voice selection and tuning options (e.g., speed and volume) allow for tailoring Jarvis’ voice.

4. Speech Recognition with `speech_recognition`

The following code block continuously listens for user input and converts it to text:

def record_text():
    while True:
        try:
            with sr.Microphone() as source2:
                r.adjust_for_ambient_noise(source2, duration=0.2)
                print("Listening...")
                audio2 = r.listen(source2)
                print("Recognizing...")
                MyText = r.recognize_google(audio2)
                MyText = MyText.lower()
                print("You said:", MyText)
                return MyText
        except sr.RequestError as e:
            print(f"Could not request results; {e}")
        except sr.UnknownValueError:
            print("Unknown error occurred, please try again.")

This function leverages Google’s speech recognition API to provide accurate transcriptions. The adjust_for_ambient_noise method improves recognition in noisy environments.

5. Maintaining Context with Conversation History

To provide context-aware responses, a message history is maintained:

messages = []
messages.append({"role": "system", "content": "You are Jarvis, Charan's AI."})
def send_to_jarvis(message):
    history_text = "".join([
        f"{msg['role'].title()}: {msg['content']}\n" for msg in messages
    ])
    response = chain.invoke({"history": history_text, "question": message})
    return response

The assistant’s responses are enriched by previous exchanges, ensuring consistent interaction.

6. The Main Loop

The application continuously listens for input, processes it, and responds:

while True:
    user_message = record_text()  # get user speech
    messages.append({"role": "user", "content": user_message})
    jarvis_response = send_to_jarvis(user_message)
    messages.append({"role": "assistant", "content": jarvis_response})
    print("Jarvis:", jarvis_response)
    SpeakText(jarvis_response)

This loop creates an interactive experience where the user can naturally converse with Jarvis.

Enhancements and Use Cases

Customizable Persona: Modify the prompt to create assistants with unique personalities or expertise.
Environment-Aware Responses: Add modules to fetch weather, control IoT devices, or retrieve contextual information.
Multi-User Support: Enhance the system to differentiate between users and provide personalized responses.

Conclusion

This project highlights how readily available libraries can be combined to create an intelligent and interactive voice assistant. With tools like LangChain, Ollama, and speech_recognition, developers can focus on fine-tuning user experiences without building complex NLP systems from scratch.

Try building your own AI assistant today, and let us know how you enhanced it!

Building a Real-Time Voice Assistant: Integrating Speech Recognition, Text-to-Speech, and LLMs

Overview of the System

Key Components

1. Setting the Foundation

2. Crafting Jarvis’ Personality

3. Speech Synthesis with `pyttsx3`

4. Speech Recognition with `speech_recognition`

5. Maintaining Context with Conversation History

6. The Main Loop

Enhancements and Use Cases

Conclusion

Written by Charan H U

No responses yet

Building a Real-Time Voice Assistant: Integrating Speech Recognition, Text-to-Speech, and LLMs

Overview of the System

Key Components

1. Setting the Foundation

2. Crafting Jarvis’ Personality

3. Speech Synthesis with pyttsx3

4. Speech Recognition with speech_recognition

5. Maintaining Context with Conversation History

6. The Main Loop

Enhancements and Use Cases

Conclusion

Written by Charan H U

No responses yet

3. Speech Synthesis with `pyttsx3`

4. Speech Recognition with `speech_recognition`