Apple research paper reveals AI that understands visual elements

We'll likely learn more about Apple's AI developments at WWDC on June 10th

Researchers at Apple have reportedly developed a new AI system called ReALM (Reference Resolution As Language Modeling), that can read and understand visual elements, essentially being able to decipher on-screen prompts.

The research paper suggests that the new model reconstructs the screen using “parsed on-screen entities” and their locations in a textual layout. This essentially captures the visual layout of the on-screen page, and according to the researchers, when a model is specifically fine-tuned for this approach, it could outperform even GPT-4, and lead to more natural and intuitive interactions.

“Being able to understand context, including references, is essential for a conversational assistant,” reads the research paper. “Enabling the user to issue queries about what they see on their screen is a crucial step in ensuring a true hands-free experience in voice assistants.” The development could one day make its way to Siri, helping it become more conversational and “true hands-free.”

While it is unlikely that we’ll hear more about ReALM this year, we should be learning more about AI-related developments, including features coming to Siri at WWDC 2024 on June 10th.