Last year, the leaders of Home Assistant declared 2023 the “Year of the Voice.” The goal was to let users of the DIY home automation platform “control Home Assistant in their own language.” It was a bold shot to call, given people’s expectations from using Alexa and the like. Further, the Home Assistant team wasn’t even sure where to start.
Did they succeed, looking in from early 2024? In a very strict sense, yes. Right now, with some off-the-shelf gear and the patience to flash and fiddle, you can ask “Nabu” or “Jarvis” or any name you want to turn off some lights, set the thermostat, or run automations. And you can ask about the weather. Narrowly defined mission: Accomplished.
In a broader, more accurate sense, Home Assistant voice control has a ways to go. Your verb set is limited to toggling, setting, and other smart home interactions. The easiest devices to use for this don’t have the best noise cancellation or pick-up range. Errors aren’t handled gracefully, and you get the best results by fine-tuning the names you call everything you control.
It’s not entirely fair to compare locally run, privacy-minded voice control to the “assistants” offered by globe-spanning tech companies with secondary motives. Paulus Schoutsen, founder of Home Assistant, knows this, but he’s motivated to keep improving anyway. Schoutsen told Ars that people tend to arrive at Home Assistant after starting out with one of the big three: Amazon’s Alexa, Google’s Assistant, or Apple’s Siri. “They’re ‘outgrowers,’ so they come to us. We’re their second system,” Schoutsen said.
While outgrowers are happy to leave behind the inconsistent behavior, privacy concerns, or limitations of their old systems, they can miss being able to just shout from anywhere in a room and have a device figure out their intent. Or, failing that, their kids want music to play when they say, “Play ‘Cruel Summer’ by Taylor Swift” in the kitchen. Home Assistant is not there yet, and in some ways is not meant to be that kind of system, at least by default. But it’s improving, and it has come a very long way.
Here’s a look at what you can do today with your human voice and Home Assistant, what remains to be fixed and made easier, and how it got here.
The open source to-do list
“As it stands today, we’re not ready yet to tell people that our voice assistant is a replacement for Google/Amazon,” Schoutsen wrote. “We don’t have to be as good as their systems, but there is a certain bar of usable that we haven’t reached yet.”
Key among the improvements that need to happen, according to Schoutsen:
- Audio input needs to be cleaned up (speaker voice separated) before it is processed
- Error messages need to be more clear about what’s going wrong, and input has to have more flexibility
- Non-English languages need a lot of commands and variables
- Compatible hardware that features far-listening microphones has to be more widely available
- Most people will want local processing to be faster
All that said, it’s impressive how far Home Assistant has come since late 2022, when it made its pronouncement, despite not really having a clear path toward its end goal.
How Home Assistant built a voice framework out of parts
It seemed wildly ambitious for Home Assistant to take on the challenge of making a voice assistant that wasn’t tied to the globe-spanning server power of a tech giant. At the same time, it made more sense for Home Assistant than almost any other project. Home Assistant is, by some measures, the second-most active open source project on GitHub. Other open voice assistants have cropped up over the years, but being tied to no particular thing to assist, they have tended to fade away.
“Voice itself is just an interface. It has to be an interface to something,” Schoutsen said. “We’re in a unique position since we’re integrated with all the different services in your home, your life—that’s what you want the voice to control… We just didn’t know if it was going to actually work.”
Late in 2022, OpenAI made the open source Whisper, a speech recognition system trained on a large, multilingual, accent-friendly dataset. It was a start, but the software didn’t come close to fitting the scale of a Raspberry Pi. It’s the same with David Scripka’s OpenWakeWord, which was initially too big to fit on tiny ESP32-based gadgets but could work on a Pi or other single-board computer running Home Assistant.
Nabu Casa, which provides cloud services to Home Assistant and helps fund it, hired Mike Hansen, founder of the Rhasspy project, to help mold these pieces together and make some new ones. Once there, Hansen developed Piper, a fully local neural-net-based text-to-speech engine. OpenWakeWord could use Piper to generate the training data needed to make OpenWakeWord work at a Home Assistant level, feeding it tens of thousands of speaking samples, with simulated microphone distances and background noises.
By the end of 2023, Home Assistant and Nabu Casa had managed to tie together the four basic elements of a voice system: wake words, speech-to-text, command processing, and text-to-speech. There has been further work, aided by Scripka and Kevin Ahrendt, to make even more of this stack run locally. Today, without having to clone or compile any GitHub projects, you can run a local or cloud-based voice assistant to control your home.
So how does it work?
What it’s like using Home Assistant’s DIY voice assistant
Getting the most out of voice commands for Home Assistant requires putting some time into your Home Assistant setup and knowing it well. This is largely true for the big-name voice ecosystems, too, but more so here because there’s not as much room for error or guessing.
For example, I have “Kevin Office” as a room in Home Assistant. When I put a smart light switch in that room, I named it “Kevin Office Lights.” At first, when I would say, “Hey Jarvis, turn off Kevin Office Lights,” it was a dice toss as to whether the system would turn off the overhead lights, turn off those lights and the little lamp with a Hue bulb in the corner, or just tell me it couldn’t understand. I have since renamed things for better response, but the process makes you think through your whole scheme.
I’m using two devices as voice assistants: a $13.50 Atom Echo Smart Speaker kit from M5Stack and a $50 ESP32-S3-Box-3. Both are surprisingly simple to set up as voice assistants, using a USB-C cable and ESPHome in a Chrome browser. My Home Assistant instance is running off a Raspberry Pi 4, a common setup, but there’s an important consideration we’ll get into momentarily.
- Box 3, turn on the lights
- Box3, turn off the lights
- atom5_lightson
- atom5_taylor_swift
I set up the tiny Atom as a push-to-talk device and the Box 3 as a wake-word-listening device on my desk. A desk seems like the best place for them, given the mouth-to-mic proximity and relatively low noise level. I didn’t have many more issues getting the Box 3 to catch “Hey Jarvis” than I’d experienced with Google or Alexa speakers, at least when sitting at the desk. But trying to activate it across the room lowers the pick-up rate considerably, as does having music playing.
Schoutsen agrees. He has set up similar devices around his home and tried to get his family to use them; it “works about 50 percent [of the time], which is just enough to be frustrating,” he said. He has two children, both younger than 7, and it’s a lost cause trying to get them to speak clearly while close to the microphone. The big tech firms have the edge in noise filtering, both in terms of their hugely built-up language models and the multi-microphone devices they can sell at a loss. (They also tend to sell devices that double as music speakers, but that’s a separate issue).
There can be a hitch between the Box 3 catching your words and responding. That’s the nature of a low-power device beaming a constant audio stream to a Raspberry Pi and the Pi sorting out that stream’s noise for a clear word signal, then processing your intent. It’s faster if you use Nabu Casa’s cloud connection or if you use Home Assistant on a more powerful PC instead of a Raspberry Pi. And it’s faster still with local wake word processing on the Box 3, a feature that is coming soon (or that’s already available, if you’re familiar with ESPHome flashing).
Home Assistant’s docs cite a roughly eight-second response time for commands you want to process locally on a Raspberry Pi. I’ve had my Pi 4 respond maybe one second faster and slower than that when using the all-local pipeline. With an early version of local wake word detection installed on my Box 3, and a cloud-assisted pipeline, the time from when I’m done speaking to the command being processed is usually three to five seconds unless it needs another second to figure out what I said.
Testing the same types of commands on a Google Assistant speaker, I get vocal response times of between three and seven seconds. The success rate of the assistant interpreting what I’m saying, when said in slightly different ways, is notably higher. But Google Assistant is far more chatty when it can’t complete a command. It also frustrates me by only intermittently understanding which speaker I want things played on, and it sometimes randomly decides to tell me about some unrelated feature I didn’t ask about.
For different people, those are going to be different pluses and minuses. If you’re regularly asking your voice assistant about the weather, about the hours of local businesses, or to add things to your calendar, Home Assistant’s voice control is not for you—not yet, at least. But if you just want your home to respond to your demands and leave the rest to your always-present phone, there’s a real voice alternative, even if it’s distinctly in a beta phase.
What’s next for voice control (including a local LLM)
Running voice control on a newer Raspberry Pi 5 (for which Home Assistant is not quite ready) might speed things up a bit, but Schoutsen said not to expect big changes. Whisper, the language-processing tool, is still a heavy load, and it’s much faster running on Nabu Casa’s cloud. If you want voice to be a big, responsive part of your local-control smart home, it’s a good idea to get a Nabu Casa account or look beyond Pi devices for your Home Assistant host.
All that said, when I tell it the right things to do, at the right range, the Box 3 works just fine. I’ve used it to set the thermostat, turn lights on and off, and activate automations I’ve set up. That’s helpful for me, at least when a simple thought can be said faster than I can load up the web dashboard or grab my phone.
Beyond devices, rooms, scenes, and automations, there’s not much more to Home Assistant’s voice, at least right now (or without bolting a bunch of add-on systems to your setup). You can ask it about the weather because Home Assistant has a default weather integration. If you tack on a to-do list and a shopping list, it can work with those, too. But any conversational prompt that’s not related to an entity you’ve named in your home won’t work.
Even something as simple as setting a timer becomes tricky when working in the parlance of Home Assistant, Schoutsen noted. Is the timer a new entity that gets deleted when it runs out? Is it a scene you’re setting with a parameter? Home Assistant is built to do very specific things in certain container-minded ways, whereas a modern voice assistant wants to be a general-purpose problem-solver backed up by giga-scale cloud services.
Still, Home Assistant is working on timers and some other voice expansions, Schoutsen said. There are some interesting possibilities with letting ChatGPT in to figure out what you’re trying to tell your smart home. It runs slowly, but with work, Scoutsen expects to integrate some kind of LLM into Home Assistant as a voice control option, with a local LLM option available, of course.
Like many hobbyist projects I embark on, I thought the big challenge of setting up voice control with Home Assistant would be the gear: hardware, software, and networking. Really, though, it was figuring out what I truly wanted as a result. Home Assistant and its eager users have built up a small crew of devices that can listen. Now, we have to determine how much we really want to say.