Even if you’re not running HA-OS or HA-Supervised

Did you try installing, configuring and activating the Voice Assistant in Home Assistant? Most people give up after getting the “Sorry, I didn’t understand that!” response, no matter what they do. And trying to search online to figure out how to solve this, and get started ? It’s a mess.

But it is possible to get a working result using the voice components included with Home Assistant, although the result is neither very useful or particularly user-friendly unless you run HA-OS and can take advantage of their NabuCasa offering (Voice Cloud components).

But wait, there’s hope. In this tutorial we will walk the road starting with very dumb, unresponsive and useless implementation by following the official guides. Why waste time on this? Because it is highly educational! In order to become a 400m sprint champion, you have to start slow. First crawl, then walk, then run - you get the idea.

In this guide I will take you all the way up to a fully intelligent AI based Conversational integration that is breathtakingly fast, smart and flexible. And without the need to spend a dime, unless you want to.

I’ll show you how!

This guide assumes that you are running Home Assistant core in a Docker Container, and that you know something about how to create, configure and run containers. I’m running everything on a QNAP NAS server which has its own Container Station (CS) application capable of running any docker container image pulled from Docker HUB.

Container images can be managed in three different ways, in any combination:

We’ll be using the latter, as it offer maximum flexibility and ease of use along with a reasonably friendly graphical user interface (GUI) when used with QNAP CS.

A successful implementation depends on your demand for voice quality, responsiveness and speed. If the HW running your containers and HA itself is a Raspberry Pi, the result is going to be fairly poor. Even a medium powered NAS isn’t able to do Speech-to-Text (STT) and Text-to-Speech(TTS) conversions in near real time. So adjust your expectations! If you want true real time response - like you do with Alexa and Google, then you have to run your containers on a high power GPU card in a NAS or a PC, or use available cloud based TTS and STT engines. But there are alternatives, which I will explain shortly. Read on!

First things first

In order to enjoy voice assistants, you need something like an Alexa or Google Home pod. Sitting in front of a PC and talking to its microphone (if it has one) and listen to the built in speaker, isn’t very practical. You need some kind of stand alone device capable of offering more or less the same as Alexa and Google do. But you don’t have to. If sitting in front of your PC is good enough for you, you can enjoy this tutorial anyway. But I will be using a development kit called ESP32 S3 BOX in this tutorial.

Untitled

You can get this device online for around $45 from various dealers. I strongly recommend that, but you don’t have to. The ESP32 S3 BOX is a an AIoT device built around the ESP32 chip with a small color screen, a very poor speaker, and a remarkably good microphone - which is the most important component. We are going to flash it with a custom firmware for voice recognition. Part of the process in doing this (later on) requires the use of the ESPHome Dashboard. This software is typically installed as an Add-On if you are running HA-OS, which of course you are not. Therefore your can begin by installing ESPHome in a dedicated container, using this image:

esphome/esphome

A working docker-compose YAML file for this could be:

version: '3'

services:
  homeass:
    image: esphome/esphome:latest
    network_mode: host
    environment:
      LANG: nb_NO.UTF-8
      TZ: Europe/Oslo
    volumes:
      - /share/Public/ESPHome:/config