Today, Apple developers have set a global goal to radically change and improve the way voice assistants understand and respond to commands.
Recently, the company’s researchers presented an artificial intelligence system called ReALM (Reference Resolution as Language Modeling) to the general public.
One of the tasks set for the specialists was to radically improve the neural network’s understanding of the commands given to it by a person in their own language, as well as to respond and execute these commands quickly and efficiently.
The developers have achieved that ReALM can decipher ambiguous or ambiguous commands that can only be understood from the context of the situation. However, the program is currently able to provide the most intuitive and natural relationship between the device and the person.
It is known that one of the main problems with neural networks and voice assistants was that they did not understand and interpret assumptions, phrases, metaphors, and other elements of spoken and written language at a high enough level.
There is also a problem among modern neural networks with understanding pronouns such as “it”, “they”, or “that”, which people use to easily navigate the text.
For example, you can imagine the following situation in which a user asks Siri to “find me a healthy recipe based on what’s in my fridge, but without mushrooms – I hate them.”
With ReALM, your device will not only understand references to information on the screen (the contents of your refrigerator), but will also remember your personal preferences (dislike of mushrooms). This can expand the context of recipe search by adapting these parameters.
ReALM has already begun to partially solve this problem, as artificial intelligence is able to connect the words spoken by a person with the objects displayed on the smartphone screen.
Apple’s new AI system can enable efficient interaction with digital assistants based on what is displayed on the screen at a given moment. Without the need for precise and clear instructions. This allows digital assistants to be more effective in different situations, for example, for drivers who control a car using a smartphone by voice, as well as for users with special needs.
ReALM thus reconstructs the screen and analyzes objects and their location on it. This process allows you to create a textual representation of the screen that matches its visual context.
Reference Resolution as Language Modeling first analyzes what is displayed on the user’s device screen. After that, artificial intelligence creates text that corresponds to what was depicted a minute ago in the form of objects on the screen, as well as visually displays its content and tags parts of the screen that are objects.
ReALM uses LLM to understand the context, specific vocabulary, and relationships between them.
Advantages of ReALM:
- is the perfect solution for a practical link recognition system;
- is much easier to use than GPT-4 and performs at almost the same level;
- has a much smaller number of parameters in its arsenal and can also bypass GPT-3.5;
- outperforms the MARRS model in all types of datasets.
The ReALM-250M model shows good and amazing results:
- understanding of oral speech – 98.7%;
- synthetic understanding of tasks – 99.8%;
- 90.6% efficiency of on-screen tasks;
- processing of invisible domains – 97.2%.
So, although ReALM’s capabilities are pleasantly surprising, its biggest advantage lies in Apple’s love of artificial intelligence on their devices, which are characterized by a high level of privacy, i.e. protection, of personal data.
Thus, ReALM is designed to be used exclusively on your iPhone and other devices from this company.
By analyzing your device’s data: conversations, app usage patterns, and even environmental sensors, ReALM has the potential to create a hyper-personalized digital assistant tailored to your unique needs.