Testing 6 Image Description Services on Android: Observations and Thoughts

On November 7th, 2023, I published an article addressing the state of image descriptions. At the time, Android lacked AI-based services allowing blind individuals to share or capture photos and receive descriptions about them. Now, after less than five months, I’m excited to present my findings testing six image description services on Android. Spoiler alert: Four of them yielded impressive results!

Before going into details and observations, it’s important to emphasize appreciation for all efforts in providing image descriptions to blind people. This article isn’t about declaring winners or losers; rather, it aims to evaluate available options and assess the current reliability of image description, indicating potential caveats.

Table of Contents

Tested Services and procedure used:
The inferior services:
General Observations:
App specific observations:
Final thoughts:

Tested Services and procedure used:

Tested Services:

Be My AI from Be My Eyes v2.4
Microsoft Seeing AI v1.1
Sullivan Finder v0.1.1
Envision v3.4.0
Scene description from Tech-Freedom v4.1.3
Google Lookout v4.3_reveal_20230811.00_RC13 (arm64-v8a)

Procedure:

To ensure equal conditions, nine images were carefully selected and shared with each service. The received descriptions were saved and analyzed. All images were captured and verified by sighted people. Previous experiences with the tested services were not taken into consideration.

In selecting the images, I aimed to assess how the services handle various scenarios. The images include: a completely dark image, a street view with shops and cars, a detailed living room with a hand deliberately appearing, two plates of food on a table, a child driving a four-wheeled cart, two people playing billiards, two snack packages, and two famous political figures. These images were chosen to evaluate how accurately the models describe details, people, text in photos, and how they handle the identification of famous figures, particularly political ones, considering the current limitations that might be imposed on AI models to fight election misinformation.

Four of the tested services—Tech-Freedom, Seeing AI, Be My AI, and Sullivan Finder—offer the capability to ask additional questions about the image. However, this feature was not utilized in the tests. Nonetheless, it’s an important addition that may be discussed in a later article.

The Envision app also has an “Ask Envision” feature, but currently, it relies on the text provided as a description about the image rather than the image itself. This distinction makes it not directly comparable to the other offerings.

Also, it is worth noting that Tech-Freedom and Seeing AI give the ability to save images and share them with others.

Additionally, although all used apps offer a variety of other services, this article solely focuses on their image recognition and description capabilities.

The inferior services:

Unfortunately, Google Lookout and Envision AI showed the inability to compete effectively.

Despite Google’s development of the Gemini AI model, it failed to employ this technology in its blind-related app. Throughout the testing, Lookout exhibited weak language usage, struggled to structure meaningful descriptions, and often lacked necessary details. While other services tested are online, Lookout downloads offline data for image description, but switches to an online model when internet connectivity is available. By comparing the descriptions provided by Lookout with and without an internet connection, I could easily see a difference in how it handles descriptions so i did all the tests with an active internet connection.
When describing the provided images,Lookout often relied on phrases like “we can see,” and some sentences were unclear. In many instances, it failed to provide adequate details. For example, in describing a living room photo, it simply stated, “In this picture we can see the person holding the hand. We can see a hand on the carpet.” Similarly, in a street photo, it concluded with, “It looks like a place and there is the sky in the background.”. In a boy riding a peddle go-cart it stated “In this picture we can see the child holding the handlebars of a toy vehicle and riding it on the wooden cart lane. There is a person sitting on the left side and also there is another person sitting on the right side on the grey color railing and they are holding the orange bicycle and they are riding it.”. but to be fair with the service, it captured a trash can in the view of the same photo, an item that no other tested service mentioned.

Envision also struggled to compete, providing only minimal details in its descriptions, often limited to just one or two sentences.
While these descriptions may have been considered impressive two years ago, the presence of more detailed AI descriptions now highlights their weaknesses.

These two apps will be mentioned below only when necessary.

General Observations:

Impressive Overall Results:

The first prominent observation is that the results obtained from the four AI-based services were pleasing. Across the nine images, all got the basics right. The descriptions were meaningful and provided enough details, although some were incorrect, of course. The hallucination was kept to a minimum, which pleasantly surprised me.

<>In the dark image, Sullivan Finder clearly stated, “The image is predominantly dark, making it difficult to discern any specific details. It appears to be completely black, suggesting it may have been taken in a low-light setting or the lens was covered.” Be My AI similarly described the darkness accurately, noting a “very faint, slightly lighter black circle near the center, but it’s barely visible.” There was a slight hallucination about the presence of a circle. Seeing AI briefly mentioned a circle in its short description but did not include it in the full detailed description. Tech Freedom also correctly identified the absence of visible objects, describing the image as “solid black throughout.” Even Lookout accurately stated, “This picture is taken in a dark room.” Only Envision missed that image completely, describing it as: “Looks like an airplane flying in the dark sky with the moon in the background.”.

The image of a city taken from an airplane exemplified the impressive recognition capabilities of the AI services. Seeing AI accurately described the scene: “The image captures an aerial night view of a densely populated area with numerous lights. A main thoroughfare, brightly illuminated, runs vertically through the center of the image, with smaller streets branching off horizontally. The thoroughfare appears to be a significant road, possibly a highway, due to its straight path and the concentration of lights along it. The surrounding areas are speckled with clusters of lights, indicating buildings and other structures. The density of lights decreases towards the edges of the image, suggesting the central area is more heavily developed or populated.” Be My AI also provided a detailed description: “The picture shows a nighttime aerial view of a cityscape. The city is illuminated by countless lights, giving it a vibrant and lively appearance. The lights are in various colors, predominantly white, with some areas glowing in yellow or red hues. There are dense clusters of lights indicating a highly populated area with many buildings. A main road can be seen snaking through the city with bright white lights, suggesting heavy traffic. The surrounding darkness hints at the vastness of the city extending beyond the frame. The view is slightly blurred, possibly due to the movement during the capture of the image.” Similarly, Sullivan Finder’s description was accurate: “This image depicts a night-time aerial view of a sprawling city. The area is densely packed with numerous lights from buildings, streets, and other structures, creating a luminous tapestry against the dark surroundings. A major thoroughfare visibly cuts through the middle of the frame, distinguished by brighter, more concentrated lighting, and appearing as a bustling main artery amidst the urban landscape. The perspective suggests that the photo was taken from an elevated position, likely from an aircraft given the angle and the presence of what appears to be part of an airplane’s interior in the top right corner.”

The intentional fingers appearing in the living room photo were easily detected by Seeing AI, Sullivan Finder, and Be My AI, all suggesting that they may belong to the person taking the picture. However, this detail was missed by Tech Freedom.

The kid’s activity in the riding image was correctly identified, as well as the food on the table, the street main view, etc. The four services were generally close, with few differences that will be highlighted throughout the article.

Additionally, the weather and lighting conditions were consistently recognized throughout the tests.

The details confusion:

Giving as much details as possible could be very useful in describing images. However, The issue of providing incorrect details in image descriptions can indeed be problematic. In the case of the living room photo containing furniture and items on tables, as well as a turned-on TV, the AI services displayed inconsistencies in their descriptions.

In the living room, there were 3 tables and 2 big sofas. None of the services could identify all of them. The tables were always 2 and the big sofas were also one (just a note that 1 side of a sofa is not clearly visible which may be as an excuse).

Seeing AI described the other details mostly correctly but mistakenly identified a teal-colored ottoman, which was likely due to the wooden part of the sofa and its legs.

Be My AI stated that there are two matching armchairs with a purple pattern and a small wooden side table between them. However, in reality, the table is not positioned between the chairs.

Moving to another image, which is the two snack bags, Be My AI mixed up the backgrounds and the text, adding the text that was written on the red bag to the orange bag.

Misidentification and Fake Elements:

The lighting conditions and how certain things appear in the photos play a significant role in correctly identifying them. However, the issue with AI services is that they attempt to identify details even if they are not presented clearly, often using a confident tone. They also tend to add certain elements immediately when other elements are present.
When the services identified the meal as shredded chicken with rice, most of them added other items, although those items were not clearly visible in the container. Beans were included in the descriptions by all four AI services, with Tech-Freedom and Be My AI also adding vegetables, although Be My AI used the word “possibly,” indicating uncertainty. It is likely that the nuts were labeled mistakenly as beans.

Also in the meal image, the spoon was wrongly identified as a fork by all the AI services except Tech-Freedom.

Returning to the living room photo, the heater in the room was misidentified by all four services, three of them designating it as a speaker and Sullivan Finder referring to it as a large wooden chest with visible round ornate carvings on top. Also, in the same photo, what was displayed on the TV was identified by Sullivan, Be My AI, and Seeing AI as either an animated show or characters, when in reality it was showing a show with real people. Tech-Freedom ignored the state of the TV completely, only mentioning its presence in the room. Near the TV, both Seeing AI and Sullivan Finder mentioned a DVD player, with Sullivan being less certain and using the word “possibly.” Be My AI described the TV stand as dark wood with a router on top and various compartments, avoiding any incorrect identification. However, Tech-Freedom did not mention the TV stand at all, simply stating that a television set was mounted on the wall. Although there is no DVD player beside or close to the TV, or even in the whole room, according to a sighted image viewer, an item resembling a DVD player is present, which is related to the positioning of the receiver. Additionally, there is a power bank under the router, so the misidentification may be somewhat justified here.

In the photo of 2 men playing billiards, Be My AI mistakenly identified a cup on a chair, which is not visible in the entire photo. Additionally, both Be My AI and Seeing AI incorrectly identified the men’s reflections as another person, with Be My AI suggesting it as a spectator and Seeing AI as the person who is taking the photo.

When presented with the image of the kid, only Sullivan Finder directly mentioned a four-wheeled pedal go-kart, while Seeing AI and Be My AI only stated ‘pedal go-kart.’ Tech-Freedom incorrectly described it as a tricycle, similar to Envision.

Another misidentification occurred in the street view photo when Be My AI mentioned, “A man is sitting on a raised flower bed filled with lush plants and flowers in front of the flower shop.” However, the man was actually standing, not sitting, and the concept of a raised flower bed seems inaccurate. Be My AI also misinterpreted the man’s posture as sitting but did not mention the flower bed. Only Sullivan Finder correctly identified the man’s standing posture.

Tech-Freedom faked the presence of greenery in the cityscape image taken from the airplane, even though there was no visible evidence of such green areas. The description stated: “The roads are lined with trees, their silhouettes adding a touch of nature to the urban landscape. The city is not just a concrete jungle; there are patches of green interspersed throughout the urban sprawl.”

In the presidents’ photo, there were pen and paper on the table; interestingly, all the services missed that detail completely. The 4 AI services just talked about the flowers that were also on the table.

People recognition:

Four of the provided images featured individuals. For the most part, the people were identified, except in the street scene where Tech-Freedom stated that no people were visible in the image. I noticed that the services refrained from describing the looks of people, such as face details or color, likely to avoid biased identification or misinterpretation.

In the billiard game, there was no indication of the approximate age of the two persons, despite one of them being old. The only service that identified something in the characters’ appearance was Be My AI, stating that one person is bald and the other has gray hair.

In the image of the kid driving the go-kart, only Seeing AI, in the short description that appeared before the more detailed information, gave the kid an age of 9 years old. also, Other kids appearing in the background were not recognized in the descriptions. Only the pink go-kart close to the main subject was identified , with a child riding it by Be My AI, while Seeing AI mentioned the pink go-kart but didn’t mention a rider. Sullivan Finder only indicated the presence of other go-karts in the view.

The services seemed more comfortable with describing clothing colors and types, such as a blue shirt, a blue cap, a suit, etc.

Regarding the famous political figures, the only service that identified them was Be My AI, stating that they are Vladimir Putin, the President of Russia, and Donald Trump, the former President of the United States. Sullivan Finder recognized them as 2 men and identified flags of their respective countries behind them, indicating a diplomatic or political meeting. However, it incorrectly depicted the two men with the same hands placed on their knees stating also that they were sitting opposite to each other, whereas in reality, one of them had his hands in that position, and they were sitting beside each other.
Tech-Freedom also detected the flags but couldn’t identify to which countries they belong, but it also mentioned the diplomatic meeting. Lookout couldn’t identify the flags, but Envision did, stating that there are two men in suits. However, the surprise was with how Seeing AI treated the image, which will be discussed later while talking about app-specific observations.

Text Recognition:

I didn’t include a photo of a paper or a screenshot of texts among the photos because usually, it is easy to identify this type of documents. The texts were mainly in the snack bags and in the names of the shops in the street view. The brand of the snack that is clear was identified by all four AI services. However, most of them struggled to recognize the other texts correctly. Be My AI mixed things badly to the point that it recognized “potato” written on one of the bags and then added an image of potato crisps to the bag. Tech Freedom didn’t mention the spicy wording, whereas Sullivan Finder detected the spicy in its correct position but considered “mixed nuts” as the flavor of the other bag, which didn’t really include words regarding the flavoring. It also reversed the bags’ positions, putting the hot and spicy on the left and the other one on the right. Seeing AI seemed the best in this regard, stating the “hot and spicy” text correctly as well as the “snack mix” wording.
In the street view, things were more challenging, especially with the presence of Arabic text. None of the services mentioned a word about the presence of this text. Also, when reading the English shop signs, none of the services performed well, with mistakes ranging from wrong text identification to the inability to state that two texts were related to the same shop.
It is worth mentioning finally that Lookout, Envision, and Seeing AI identify the recognized text in images separately, with Envision being the only one that can identify Arabic characters. However, in images like the one provided of the street, the identified text is just scattered with no real use, such as being unable to determine if a phone number belongs to this or that shop.

App specific observations:

Seeing AI:

In most photos, Seeing AI performed very well, providing details in images that no other service talked about. Seeing AI was the only service that identified car types in the street: “red hatchback car and two black SUVs. On the left side of the image, there is a white van with its rear doors open, parked next to a black tent-like structure with the letters ‘SC’ on it, which could be a temporary setup or a small business outlet.” However, the suggestion about the work of the van was not accurate, as it was parked there to load or unload items. In the same photo, the app also identified the presence of trees and powerlines running above the street.

A cool feature included in Seeing AI is the recognition of faces. The feature allows users to teach the app to recognize people by taking photos of their faces and giving them names. When a face is detected in the image, the person will be identified by the provided name. This worked well in the kid’s photo.

Despite the impressive overall results, two things impacted the experience badly. The first one is the times the app fails to give a description. With many photos, I needed to tap “try again,” and sometimes tap this more than 5 times to make the app give the details. This occurred at times when every other service was working with no issues, and the waiting time was less than a minute.

The other surprising issue was how Seeing AI dealt with the political figures photo. I shared the photo from the file manager like I did with all other photos, only to find that I was returned back to the file manager with nothing displayed. I tried again and again to have finally received the message that Seeing AI had stopped with the wait and close program buttons.
I didn’t bind this to the type of the photo at that moment, so I tried again with the same exact results. I tested with other photos and no issues were happening. I did the test several times to make sure that wasn’t just an incident, but I always got the same result.

It is understandable that limitations might be imposed on the AI service used, but going so far like that wasn’t expected. Such an issue needs attention from the Microsoft-related team.

Be My AI:

What made Be My AI special was that it was the one among the services with the fewest restrictions. It was the only one that identified the presidents correctly, with just a mistake in the clothing color. It also recognized the men in the billiard game as a bald man and another with gray hair. Additionally, it identified another kid riding a pink peddling go-kart and recognized the presence of unfinished buildings in the street view. Despite some mistakes, Be My AI had an overall good recognition. With the 2 mentioned problems of Seeing AI, Be My AI could be given the crown among the other services.

Sullivan Finder:

I have no clue which AI model Sullivan Finder uses, but it performed fairly well during the tests. It may be slightly less detailed compared to Seeing AI and Be My AI, but it delivered good overall results. Sullivan was the service that identified plants on the balcony that appeared in the living room photo by saying: “Decorative plants are also visible in the background beside the windows.” Another noteworthy point about Sullivan Finder is that it suggests related questions after the description of each photo. While not an essential addition, some people might find it useful.

Tech-Freedom:

Although Tech-Freedom provided details, compared to the other three services above, the descriptions were slightly weaker. The biggest drawback with Tech-Freedom, however, is the over-positivity and the unnecessary comments. I recall encountering similar issues during my tests of Bing Chat, which were discussed in the article I linked to in the beginning. It’s clear that the AI model used has this problem. While other services could give comments here and there, they were not as prominent as Tech-Freedom’s unnecessary additions. Certainly, adding certain comments or reasoning to certain descriptions when appropriate could make a big difference for a blind person. However, there is a fine line between this type of commentary and the judgmental, usually positive tone that treats every photo as a masterpiece, or delivers an exaggerated piece of writing that might work for poetry more than real-life descriptions.

Examples of Tech-Freedom’s tone include: “The child seems to be enjoying their time on the tricycle, possibly riding it around the deck.” Putting aside that it was a four-wheel vehicle, the suggestion that the kid was enjoying the time was not necessary, especially since in the photo the kid’s face seemed neutral. Also, regarding the food: “The salad appears fresh and vibrant, suggesting it might be a healthy meal option. A spoon is placed in the container, indicating that the dish is ready to be eaten. The table itself is a light color, providing a neutral backdrop that allows the colors of the food to stand out. The overall setting suggests a casual mealtime, possibly for lunch or dinner.” Additionally, in the aerial city image: “The image captures a breathtaking aerial view of a city at night. The city is densely packed with buildings, their lights twinkling like stars against the night sky. The city is not just a concrete jungle; there are patches of green interspersed throughout the urban sprawl. These green spaces provide a stark contrast to the city’s predominantly white and gray colors.” In the living room description: “The image shows a cozy living room with a warm and inviting atmosphere. The room is furnished with a comfortable black leather sofa. A small white vase adds a touch of elegance to the room. Overall, the living room is a well-designed space that combines comfort and style. The choice of colors and materials creates a harmonious and inviting environment.” Additionally, it described the billiard game as “dimly lit, creating an atmosphere of concentration and focus.”.

Final thoughts:

Image description services have shown great progress, with AI models advancing in this field by combining recognition capabilities with human understandable language. Having free image description services in the hands of blind Android users helps them identify images they encounter or try to capture. Taking advantage of the available services, even basic ones, is never a bad choice. As shown in the simple 9-image test, each service excelled in some images and missed some details in others, but overall, the detection rate is high with descriptions closely aligning with the real images.

Moreover, the ability to ask questions about the photos enables users to receive more information and engage in a conversation about the images, ensuring better accuracy and completeness. Users should assess image description services fairly, recognizing their strengths and weaknesses. Additionally, since these services are AI-based, they may behave differently when analyzing the same image, as observed when showing the snack bags to one of the services after the test, resulting in a description of the cartoon figure’s gesture on the bag.

Personally, I am very excited to continue monitoring how AI image descriptions will evolve, particularly to see if current issues persist, if blind-related apps will be able to keep up with using newer, more developed models, whether the models themselves will consider blind users’ needs, how the models will interpret complicated images like memes,and if new complications will arise. Looking forward to Image Description Testing Take 2.

2 Comments

Talha Haider

Excellent article. Your review of the A55 was superbly written, covering each and every important aspect. And this one isn’t any different. I have recently discovered a couple of other apps on Android with similar capabilities, it’s nice to see we’re having so many different options to choose from without worrying too much about a dip in much quality. It’s baffling to think that before December last year, we hardly had any useful resource to turn to for decent image descriptions. Brilliant stuff, keep up the wonderful work!

26 April 2024 Reply
- Kareen Kiwan
  
  Thank you, I highly appreciate your kind words. I enjoy testing AI image description capabilities and how they are developing with time. Sure, having more services, especially reliable ones translates to a better overall experience. Keep exploring and consider sharing your findings and discoveries 🙂
  
  26 April 2024 Reply

Testing 6 Image Description Services on Android: Observations and Thoughts

Tested Services and procedure used:

Tested Services:

Procedure:

The inferior services:

General Observations:

Impressive Overall Results:

The details confusion:

Misidentification and Fake Elements:

People recognition:

Text Recognition:

App specific observations:

Seeing AI:

Be My AI:

Sullivan Finder:

Tech-Freedom:

Final thoughts:

Share this:

About Author

Kareen Kiwan

2 Comments

Leave a Reply Cancel reply

Donate to Us

Subscribe to Blind Android Users mailing list

Accessible Android on Mastodon

Testing 6 Image Description Services on Android: Observations and Thoughts

Tested Services and procedure used:

Tested Services:

Procedure:

The inferior services:

General Observations:

Impressive Overall Results:

The details confusion:

Misidentification and Fake Elements:

People recognition:

Text Recognition:

App specific observations:

Seeing AI:

Be My AI:

Sullivan Finder:

Tech-Freedom:

Final thoughts:

Share this:

Related:

About Author

Kareen Kiwan

2 Comments

Leave a Reply Cancel reply

Donate to Us

Subscribe to Blind Android Users mailing list

<img class="rss-widget-icon" style="border:0" width="14" height="14" src="https://accessibleandroid.com/wp-includes/images/rss.png" alt="RSS" /> Accessible Android on Mastodon

Accessible Android on Mastodon