Build a Live AI Voice Agent with Gemini 3.1 Flash Preview and Agora

Real-time voice AI has moved well beyond demos—it’s now production-ready.

With Google’s Gemini 3.1 Flash Live Preview and Agora’s global real-time network (powering over 80 billion minutes of audio and video every month), developers can build low-latency, multilingual voice agents faster than ever.

In this guide, you’ll learn how to wire up a fully functional voice agent in minutes—and see how the same architecture scales to real-world use cases like robotics and conversational commerce.

‍What you’ll build

By the end of this tutorial, you’ll have a live voice agent that can:

Understand and respond to speech in real time
Switch seamlessly between multiple languages mid-conversation
Generate natural audio responses
Invoke tools and external actions dynamically
Run on Agora’s real-time infrastructure

We’ll also highlight two real-world demos:

A physical robot interface
A voice-powered food ordering kiosk

Prerequisites

Before getting started, make sure you have:

‍Node.js installed
An Agora account (for App ID + Certificate)
A Google AI Studio account (for Gemini API key)

Step 1: Clone the Agent Quickstart

Start with Agora’s agent quickstart repo. Clone it and move into the project directory:

git clone https://github.com/AgoraIO-Conversational-AI/agent-quickstart-nextjs

cd <project-folder>

Open the project in your preferred editor (VS Code, Cursor, etc.).

Step 2: Configure your environment

Copy the example env file to create your own:

cp .env.example .env.local

You’ll need three values inside it:

AGORA_APP_ID = <your-app-id> 
AGORA_APP_CERTIFICATE = <your-certificate> 
GEMINI_API_KEY = <your-gemini-key>

To get your Agora credentials, sign in to the Agora Console, create a new project, click Configure, and copy your App ID and Primary Certificate. While you’re there, enable the Conversational AI feature for your project — one click, then confirm. Your Gemini API key comes from Google AI Studio. Keep it out of version control.

Step 3: Swap in Gemini Live

Open app/api/invite-agent/route.js. The default SDK is configured for a chained pipeline — speech-to-text → LLM → text-to-speech. For Gemini 3.1 Flash Live Preview, you replace that entire chain with a single native multimodal call.

Import the Gemini Live module at the top of the file, remove the three pipeline steps, and replace them with:

.withMllm(new GeminiLive({
        model: 'gemini-3.1-flash-live-preview',
        apiKey: process.env.GEMINI_API_KEY ?? '',
        url: `wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=${process.env.GEMINI_API_KEY ?? ''}`,
        inputModalities: ['audio'],
        outputModalities: ['audio'],
        failureMessage: 'Sorry, I encountered an issue. Please try again.',
        instructions: ADA_PROMPT,
        voice: 'Charon',
        additionalParams: {
          affective_dialog: false,
          proactive_audio: false,
          transcribe_agent: true,
          transcribe_user: true,
          http_options: {
            api_version: 'v1beta'
          },
        },

The model, system prompt, and greeting string are defined as variables earlier in the file — just reference them here.

Step 4 : Run it

npm run dev

Navigate to localhost:3000, click Try it now, and speak. That's it.

See it in action

The quickest way to appreciate what the model can do is to throw multilingual requests at it mid-conversation. In our own testing, the agent handled seamless language switches without missing a beat — English to German to French to Chinese, all within a single conversation, with no reconfiguration required.

‍Beyond Chat: Tool Calling and Physical Interfaces

This architecture isn’t limited to voice chat.

Robotics Demo

We integrated the agent with a Reachy Mini robot, enabling:

70+ callable “emotes” mapped to motor controls
Real-time conversational control of physical actions
Dynamic tool selection by the model

The result: a system where conversation directly drives physical behavior—no manual routing required.

Note: Hardware introduces latency. For ultra-low-latency applications, software-only deployments perform best.

Voice-Powered Food Ordering Demo

We also built a conversational kiosk that can:

Manage a live cart
Suggest menu items
Handle substitutions and removals
Track complex order changes in real time

In testing, users:

Swapped items
Added desserts
Changed orders mid-conversation

…and the agent maintained full state accuracy throughout.

‍Why Agora for Voice AI?

Low-latency voice AI depends heavily on the transport layer.

Agora handles:

Global packet routing
Jitter buffering
Codec negotiation
Real-time synchronization

So you can focus on building your agent—not infrastructure.

With SDKs and APIs across web, mobile, and embedded platforms, you can deploy the same backend everywhere. ‍

Get Started

Watch the demo: https://www.youtube.com/watch?v=2ltcbA2CCTo
Try the repo: https://github.com/AgoraIO-Conversational-AI/agent-quickstart-nextjs
Build with Agora: https://www.agora.io

Learn more about Agora's video and voice solutions

Ready to chat through your real-time video and voice needs? We're here to help! Current Twilio customers get up to 2 months FREE.

Complete the form, and one of our experts will be in touch.

Try Agora for Free

Try for Free

TEN

App Builder

Flexible Classroom

Download SDKs

Support Plans and Pricing