Building a Real-Time Voice Assistant: Integrating Speech Recognition, Text-to-Speech, and LLMs
Creating a responsive, real-time AI voice assistant has become increasingly accessible with modern libraries and APIs. In this blog, we’ll explore a Python implementation that merges speech recognition, text-to-speech (TTS), and a custom Large Language Model (LLM) prompt to bring a Jarvis-inspired assistant to life.
Overview of the System
This assistant listens to user input, processes the speech into text, interacts with an LLM, and replies both audibly and textually. The assistant, named “Jarvis,” is designed to mimic the persona from the Iron Man universe — polite, witty, and succinct.
Key Components
The implementation combines several technologies:
- Speech Recognition: Using the
speech_recognition
library to transcribe user speech. - Text-to-Speech (TTS): Utilizing
pyttsx3
for real-time audio responses. - Custom AI Prompting: Employing
LangChain
andOllama LLM
for generating intelligent, context-aware responses.
Let’s dive into the code, step-by-step.
1. Setting the Foundation
First, the necessary libraries are imported, and environment variables are loaded using dotenv
to manage sensitive credentials.
import speech_recognition as sr
import pyttsx3
import os
from dotenv import load_dotenv
load_dotenv()
This ensures a secure and modular setup for projects involving API keys or sensitive configurations.
2. Crafting Jarvis’ Personality
The assistant’s behavior is shaped using a ChatPromptTemplate, ensuring that responses stay in character.
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
template = """
You are Jarvis, Charan's personal AI assistant from the Iron Man universe.
Your style is polite, witty, and succinct.
You address the user respectfully as \"Sir,\" or by name if provided.
You add subtle humor where appropriate, and you always stay in character as a resourceful AI.
Keep the responses short and to the point, and avoid overly verbose or complex replies.
Context / Conversation so far:
{history}
User just said: {question}
Now, Jarvis, please reply:
"""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="llama3.1")
chain = prompt | model
This defines a “Jarvis persona” and sets up the LLM to generate responses in context with the conversation history.
3. Speech Synthesis with pyttsx3
To ensure the assistant responds audibly, the following function initializes a TTS engine:
def SpeakText(text):
engine = pyttsx3.init()
voices = engine.getProperty("voices")
engine.setProperty("rate", 180) # Adjust speed
engine.setProperty("volume", 1.0) # Max volume
engine.say(text)
engine.runAndWait()
Custom voice selection and tuning options (e.g., speed and volume) allow for tailoring Jarvis’ voice.
4. Speech Recognition with speech_recognition
The following code block continuously listens for user input and converts it to text:
def record_text():
while True:
try:
with sr.Microphone() as source2:
r.adjust_for_ambient_noise(source2, duration=0.2)
print("Listening...")
audio2 = r.listen(source2)
print("Recognizing...")
MyText = r.recognize_google(audio2)
MyText = MyText.lower()
print("You said:", MyText)
return MyText
except sr.RequestError as e:
print(f"Could not request results; {e}")
except sr.UnknownValueError:
print("Unknown error occurred, please try again.")
This function leverages Google’s speech recognition API to provide accurate transcriptions. The adjust_for_ambient_noise
method improves recognition in noisy environments.
5. Maintaining Context with Conversation History
To provide context-aware responses, a message history is maintained:
messages = []
messages.append({"role": "system", "content": "You are Jarvis, Charan's AI."})
def send_to_jarvis(message):
history_text = "".join([
f"{msg['role'].title()}: {msg['content']}\n" for msg in messages
])
response = chain.invoke({"history": history_text, "question": message})
return response
The assistant’s responses are enriched by previous exchanges, ensuring consistent interaction.
6. The Main Loop
The application continuously listens for input, processes it, and responds:
while True:
user_message = record_text() # get user speech
messages.append({"role": "user", "content": user_message})
jarvis_response = send_to_jarvis(user_message)
messages.append({"role": "assistant", "content": jarvis_response})
print("Jarvis:", jarvis_response)
SpeakText(jarvis_response)
This loop creates an interactive experience where the user can naturally converse with Jarvis.
Enhancements and Use Cases
- Customizable Persona: Modify the prompt to create assistants with unique personalities or expertise.
- Environment-Aware Responses: Add modules to fetch weather, control IoT devices, or retrieve contextual information.
- Multi-User Support: Enhance the system to differentiate between users and provide personalized responses.
Conclusion
This project highlights how readily available libraries can be combined to create an intelligent and interactive voice assistant. With tools like LangChain
, Ollama
, and speech_recognition
, developers can focus on fine-tuning user experiences without building complex NLP systems from scratch.
Try building your own AI assistant today, and let us know how you enhanced it!