Chela Robles and her family celebrated her 38th birthday at One House, a bakery in Benicia, California. During the car ride home, Robles used her Google Glass to request a description of the outside world. With her vision impaired due to the loss of sight in both eyes, Robles feels the absence of subtle details that facilitate human connections, such as facial expressions and cues. Seeking assistance, she enrolled in a trial with Ask Envision, an AI assistant powered by OpenAI’s GPT-4, a multimodal model capable of processing images and text to provide conversational responses. This integration of language models into assistive technologies for the visually impaired offers users a more comprehensive understanding of their surroundings, fostering increased independence.
Ask Envision initially started as a smartphone app in 2018, enabling users to read text in photos, and expanded to include Google Glass integration in 2021. The company recently introduced an open-source conversational model that can answer basic questions and incorporated OpenAI’s GPT-4 for image-to-text descriptions. Other services, such as Be My Eyes and Microsoft’s SeeingAI, have also integrated GPT-4 into their platforms to assist visually impaired individuals in identifying objects and accessing visual information.
The upgraded functionality of Ask Envision allows for text summarization and the ability to answer follow-up questions. For example, it can now read a menu and provide information on prices, dietary restrictions, and dessert options. Richard Beardsley, an early tester of Ask Envision, appreciates the hands-free option offered by Google Glass, as it allows him to use the service while managing his guide dog and a cane simultaneously. Integration of AI into visual aids has the potential to profoundly impact users, offering them a wealth of information. Sina Bahram, a blind computer scientist and accessibility advisor, attests to the significant advancements GPT-4 brings, providing users with detailed information they couldn’t access just a year ago.
While the adoption of cutting-edge technology by the blind community is exciting, Danna Gurari, an assistant professor of computer science, expresses concerns about the potential drawbacks. Gurari organizes the Viz Wiz workshop to facilitate collaboration between companies like Envision, AI researchers, and blind technology users. During early testing, Gurari discovered that image-to-text models can produce erroneous information or “hallucinate.” Trusting the AI to accurately identify the contents of a sandwich, for example, could lead to detrimental consequences if it misidentifies medication. Moreover, the utilization of flawed large language models raises the risk of AI misidentifying attributes such as age, race, and gender, exacerbating biases and prejudices present in the training datasets.
Bahram acknowledges these risks and suggests implementing confidence scores to help users make informed decisions based on the AI’s interpretation. He argues that blind individuals deserve access to the same information as sighted individuals, highlighting the unfairness of withholding visual information from those without sight. While technology cannot replace the fundamental mobility skills essential for independence, beta testers of Ask Envision are impressed with its capabilities thus far. However, the system has limitations, and users like Robles express desires for features such as reading sheet music and providing more detailed spatial context, such as the location and orientation of objects and people in a room. Despite its imperfections, Robles believes that any additional description provided by AI, even with the potential for errors, is valuable.