Last updated on 29 November 2024
On Android, there is no shortage of apps offering AI image description services. However, having image descriptions integrated directly into TalkBack, the built-in screen reader for Android, is an exciting development that could be a game changer. While Jieshuo, the popular third-party screen reader, offers descriptions for both the focused item and the entire screen, this feature is paid and often takes time to deliver results. After testing the online Gemini-based focus description included in Google’s TalkBack v15.0, I’m sharing my thoughts on the opportunities this feature brings, its limitations, and a quick comparison with Jieshuo’s offering.
Table of Contents
What is the Detailed Image Description Feature Introduced in TalkBack 15.0?
Google has added the ability to use Gemini 1.5 Flash online image descriptions in TalkBack 15.0. This feature allows users to upload a screenshot of the focused item to Gemini and receive a detailed description. It works with images, buttons, and other items. Although Google included an offline basic image description model starting in version 14.1, the model often failed to provide meaningful descriptions.
The addition of Gemini-based online descriptions marks a significant step forward compared to previous capabilities. Google does have an offline image description feature based on Gemini Nano, but it currently only works on Pixel 9 series devices and has notable issues with the quality of descriptions. For this reason, it won’t be discussed in this article, which will focus solely on the online feature.
To use the new online descriptions, select the “Describe image” option from the TalkBack menu or assign a gesture to it. When enabling the feature for the first time, you’ll be prompted to agree to send images to the online service for processing, with the assurance that they will be deleted after the descriptions are generated. Once you accept, you can use the service as described.
Key Benefits of the Image description Functionality
Swift, Detailed Descriptions
Throughout my testing of TalkBack’s detailed image description feature, I was consistently amazed by how quickly the responses were generated. The waiting time rarely exceeded a few seconds, with a failure rate close to 0. I can confidently say that the speed of TalkBack’s image descriptions is unmatched by any other image description service available on Android. While Lookout, Google’s own app, is fast, it doesn’t provide detailed descriptions.
Gemini 1.5 Flash shows great potential in the field of image descriptions. During my tests, it impressed me most of the time. Even when it didn’t deliver the best results, it still provided a usable description. It successfully identified politicians and a famous football player, described people’s faces but without referring to face color or race, detailed the shapes of buttons and controls, recognized objects within images, and described the backgrounds.
Overall, the accuracy of the descriptions was impressive. It even noted when a hand was present in the image or when the image wasn’t focusing on anything specific.
Onscreen Controls Identification and Description
While the offline image description provided in Google TalkBack before version 15.0 was more of a joke than a useful service, TalkBack has previously shown good results in identifying and labeling some buttons and controls. However, with the new online service, button descriptions have been elevated to another level. The rich descriptions of onscreen focused controls can help identify elements based on the shape described, even though, in my testing, known button shapes were not always translated into specific names. Instead, the descriptions focused on the shape without indicating what the button or control might be.
Despite this, the hint provided by a button’s shape can still offer some insight into its function. Additionally, receiving a description of a control, even when it is already labeled and identified, can enhance interaction with sighted users. In many cases, the label that a screen reader reads differs from what a sighted person sees. In such scenarios, blind users who know the exact shape of an onscreen element can guide sighted users on how to interact with certain app features. This can also be beneficial when following online guides, which often reference button shapes rather than screen reader labels, even in documentation written by the app developers themselves.
Independent Identification of Images and Scenes
Typically, any item that the screen reader can focus on can be sent for an online image description. This is especially useful when a user is asked to select a photo using the gallery app in situations where there’s no alternative way to pick the required photo. The user can leverage the online detailed descriptions to identify each focused photo until the correct one is found. It also proves helpful when browsing through the photos app to identify captured images or to show a sighted friend a specific photo without handing them the phone and having them navigate through all the user’s photos. While privacy concerns arise from sending screenshots of photos out of the device, I am putting those aside for the sake of functionality.
Another potential use of detailed image descriptions is determining whether an object is in focus before capturing a photo. This would require a camera app where the preview can be focused on and an object in a fixed position, along with steady, non-shaky hands.
Additionally, video calls are another area where detailed descriptions could be beneficial. In my tests with WhatsApp video calls, I was able to verify what my camera was showing, although I couldn’t access the other party’s video feed. For a blind person, however, knowing what their camera is showing is often the most important aspect.
Convenient Identification of Items That Can’t Be Easily Described by Other Non-Screen Reader Services
Using AI description services in apps typically requires either capturing a photo within the app or sharing it with the app for analysis. This means that onscreen controls, such as focused buttons, can’t be described by these services. Jieshuo offers the capability to take a screenshot of the focused item and share it with a third-party service through an extension, but this requires an extension, and as far as I know, taking a screenshot of the focused item is a paid feature.
TalkBack doesn’t offer a similar feature, but with the introduction of online Gemini image descriptions, this limitation is no longer an issue. The Gemini-based descriptions are usually detailed enough to identify the shape of a particular control without the need for workarounds or sharing items with third-party image description services.
Shortcomings and Areas of Improvements
No Way to Describe Items Outside the Screen Reader’s Accessibility Focus
The detailed image description feature in TalkBack works by capturing the currently focused item and describing it. However, if an item cannot be focused using TalkBack, it cannot be described. For example, in WhatsApp, when viewing a photo, TalkBack immediately focuses on the back button, and no matter how much you touch the screen or swipe, the actual image is never focused. As a result, TalkBack cannot provide a description of the opened image, at least according to my testing.
On the other hand, focusing on a contact’s profile picture allows the photo to be described because the profile picture label is tied to the actual image. In my tests with several profile photos, I was able to get detailed and accurate descriptions. Gemini even recognized an avatar as an illustration of a woman in one case. This example highlights the importance of having an item in focus to describe it properly. Simply touching a point on the screen and issuing the detailed image description command won’t describe the item under your finger unless it can be focused via TalkBack. If it cannot be focused, TalkBack will describe the last focused item, rather than the one you’re trying to interact with.
To make matters worse, TalkBack tends to always focus on an element when opening a new window, unlike Jieshuo, which doesn’t force a focus. For example, Jieshuo can describe the entire screen if the user doesn’t interact with any item after opening a new window.
Another prominent example is the camera preview on devices like Samsung’s camera app. Since the camera preview isn’t focusable, detecting a scene before capturing a photo remains a theoretical possibility, but not a practical one. However, this issue is resolved in apps like Tech-Freedom, which display the camera preview as a separate, focusable item.
TalkBack currently lacks any enhanced focusing mode that would allow more elements to be brought into focus, creating a limitation for what the image description service can achieve.
Descriptions Don’t Seem Tailored to TalkBack Users
While I’ve mentioned several times that the image descriptions provided by the Gemini 1.5 Flash service integrated into TalkBack are often detailed and accurate, there’s no indication that they are specifically tailored to meet the needs of blind users. For instance, when describing buttons, it would sometimes be useful to provide a hint about what the button might do. In certain images, a blind user might require specific details that go beyond the general description.
Given how a change in the prompt to an AI service can drastically alter the output, and considering Google’s capabilities as a company, adapting these descriptions to better suit blind users’ needs seems achievable. This could be done without compromising the overall detailed descriptions already provided.
Text Detection as the Biggest Victim
While TalkBack’s offline text recognition didn’t support a wide range of languages, it worked reasonably well with English. However, with the new detailed descriptions, users no longer have a way to read detected text separately. Automatic text detection can be enabled, but it doesn’t work consistently across all contexts and is less reliable than using a gesture or selecting “Describe Image” to know what text is present in the focused area.
The Gemini online description often summarizes the recognized text, and if the language isn’t supported, it may attempt to translate the text into English, as is the case with Arabic text. While a summary might be sufficient in some situations, knowing the exact text is crucial in others. Breaking a feature that previously worked well isn’t ideal. The TalkBack team could retain both text and image detection, allowing users to hear both the description and the detected text. Adding a dedicated command for text recognition or including the detected text alongside the image description would address this issue.
Update: TalkBack 15.1 has resolved this issue by including the detected text as well as any recognized icon in the image description results window. This allows users to read and review this data without any problems.
AI-Based Image Description Limitations
The progress in AI-driven image description has been remarkable. While I consider Gemini 1.5 Flash to be one of the strongest players in this field, it’s important to remain cautious when relying on AI-based recognition. There’s still a risk of missing crucial details, misidentifying elements, or even hallucinating information. For instance, when I asked the service to describe a breaking news notification written in Arabic, it misinterpreted it as a social media comment and completely mistranslated the text, altering the meaning of the news entirely.
Another issue with AI descriptions is the tendency to insert judgments or opinions. In my experience, Gemini’s descriptions are relatively free from this problem compared to some other services, but it’s not entirely immune. This aspect is still a matter of debate, as some users appreciate these types of descriptions, while others, like me, prefer the AI to avoid unnecessary suggestions or opinions—though in some cases, they might be useful.
For example, I don’t need Gemini to tell me the essence of a photo, but it would be helpful to know if an image is very dark or if the subject is out of focus.
Speaking about opinions and suggestions, I’d like to provide an example of a useful comment. When I used the “Describe Image” command on a still from a YouTube video, it accurately described the scene. However, it couldn’t identify that it was part of a TV series and added a comment about the possibility of violence and the importance of seeking help. While this was merely a scene from a series, I found the suggestion acceptable in that context.
Descriptions Presented in a New Screen with a Close Button
When the identification results are displayed, TalkBack now presents them in a new screen with a close button. While this format makes it easier to review the details, it adds an extra step for users to close the description window, either by tapping the close button or using the back gesture.
Having descriptions in a separate screen would be more justifiable if users could ask follow-up questions about the given description. However, since this feature is currently unavailable, it would be more effective to create a record of the most recent description results that users can access, rather than opening a new window each time an item is recognized.
Yet a Bigger Gap Between Samsung and Google’s TalkBack
I previously published an article discussing the challenges posed by the separation between Samsung and Google TalkBack. With the new Gemini description integration in Google’s TalkBack, this gap has widened even further. The Samsung version still lacks the old offline image descriptions, offering only text and icon detection. Compounding the issue, Samsung ties TalkBack updates to new One UI releases. This means we shouldn’t expect any form of image description until the release of One UI 7.0.
As a result, if I have a non-Samsung phone running Android 11, I can take advantage of Google’s image description service. However, if I own a Samsung device running Android 14, I cannot. While installing Google’s TalkBack via ADB is still an option, it poses risks that we shouldn’t expect most users to undertake. Additionally, compatibility issues with Samsung devices may arise without official support from Google, as Samsung is responsible for maintaining and updating Samsung TalkBack.
A Quick Comparison Between Jieshuo and TalkBack Detailed Description Features
Image-based AI descriptions have been available in the Jieshuo screen reader for several months. However, this feature is limited to paid users and allows for only 100 recognitions per day for detailed image descriptions. Additionally, the AI service used by Jieshuo is inferior to what Gemini offers. The details are often less comprehensive, and the waiting time is significantly longer, with instances where the service stops working entirely. Jieshuo’s service also fails to recognize famous people or popular brand logos, whereas TalkBack’s Gemini service can identify such information. For example, while TalkBack was able to recognize the Ferrari logo, Jieshuo could not.
On the other hand, Jieshuo allows users to recognize text for both the focused item and the entire screen, and the same applies to image recognition. Jieshuo’s detailed focus mode should enable the recognition of more items, although I haven’t tested this practically.
A serious concern worth mentioning is that Jieshuo does not provide any information about how images are handled when using the online service, leaving users uncertain about the privacy of uploaded items. Providing such clarification would be more reassuring and demonstrate professionalism.
Final Remarks
Although TalkBack is not my daily screen reader, the addition of Gemini 1.5 Flash online descriptions has fueled my excitement, and it did not disappoint me despite some shortcomings. Having this capability in a built-in, free screen reader opens up many possibilities. If Google is willing to extend beyond the current offerings, it could play a significant role in providing equal access to apps.
In the future, if the AI model becomes aware of the context, such as the app in use, and if more work is done to tailor descriptions to the needs of blind users, it could be a real game changer. Additionally, a separate text detection option is essential, especially since blind individuals often encounter text in images or other inaccessible formats.
Moreover, if Google adds the ability to ask follow-up questions after describing the image, this will further enhance the outcome, opening up many possibilities, such as obtaining more information about a specific aspect of the image or asking the service to read the exact detected text within the image.
As the first iteration, TalkBack’s detailed image descriptions are impressive. They are a vital addition that elevates TalkBack to a new level. Kudos to Google for this advancement and for not geo-restricting the description feature, ensuring equal access for users regardless of their location. I look forward to seeing how this feature develops and what it enables blind users to achieve.
Comments