this post was submitted on 22 Apr 2024
30 points (87.5% liked)
Technology
59963 readers
3505 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
There is a lot to unpack in your post and this will be very long, sorry about that:
First off, what you are requesting is called "Automated Speech Recognition" or ASR in short and the fundamental idea behind it to receive a speech signal and convert it into a workable format. Usually this workable format means text or prompted tasks. Whether this is AI or not depends largely on how broad you define AI. I wouldn't classify it as AI as, in its core, it's just statistical analysis. But AI can help fixing errors, more on that later.
ASR works on a Hidden Markov Model (HMM), a statistical model that is only dependent on the state attained in a previous event, so it's recognizing previously observed patterns. These patterns are taught to the model by a training process.
The generalized process works like this:
cut the audio signal into small frames and analyze them according to a set of features like tonality, voicedness, formants. This process is called feature extraction. Create data vectors that contain information about the features of the raw signal.
load these features into a decoder. The decoder is an acoustic model that looks up the phonems it recognized through the features in a dictionary and computes the most likely word in its dictionary. These results are retained and sequences of words are compared to the decoders language model. What it recognizes and how well it recognizes signals is based on its own dictionary and the language model used afterwards.
Language models are essentially just presets that dictate what is accepted as a valid signal input. For an activation phrase, this would be a very simple grammar-based model that recognizes only the exact predefined token for the activation and rejects everything else. For general use, you can write a more adaptive grammar, or many different grammars at the same time, but you will still run into cases where the model rejects an input because it cannot find a grammar that matches the signal. This is called out-of-grammar (OOG) speech.
To reduce OOG errors, you can train a statistical language model (SLM) which is basically just a huge library of natural language data so it doesn't rely on fixed grammars. An large language model (LLM) is like a very advanced SLM with a ridiculous amount of training data and trained, contextual connections between subjects. It's called large because it requires an insane amount of data to function on even a very basic level. You can easily mix grammar-based and SLM approaches, so that you only need to use the SLM when an input is not recognized.
Source: Writing programs that recognize speech inputs and do tasks based upon them, like what your doctor probably has, was my last job until I quit. Whether we used a grammar-based approach or an SLM approach was entirely up to the specific use case. Purely grammar-based is more privacy-friendly because the computational work required is easily managed by most smartphones or other small portable devices and can easily be done offline. SLM solutions were generally not portable to handheld devices without relying on a cloud service doing the recognition (or at least not if you wanted an acceptable speed of input processing).
tl;dr If you want just plain text-to-speech where the program just writes down what it thinks you said and does not do any error correction, then you can do that offline (the language model my workplace used was from Dragon). If you want your assistant to "understand" what you were trying to say, you will require AI of some form and they are not very privacy-friendly.