Images are everywhere. They are found in your personal and group chats on instant messaging apps, they dominate a significant portion of your social media feeds, and they likely reside in your gallery app with no text labels.
Not too long ago, when your screen reader announced “image,” you felt powerless. Your only method to understand the content of a sent image, for example, was to ask the sender for an explanation, sometimes burdening the other person with guilt.
However, over time and with technological advancements, things started to improve. Apps capable of identifying objects within an image began to emerge. This identification progressed to a point where software could generate short meaningful descriptions for images.
With the ongoing AI revolution, we’ve reached a new level, where AI chatbots have gained the ability to identify image content and give detailed descriptions in a human-like manner. Furthermore, they can express these descriptions in a natural conversational way.
In this article, I will explore the current state of image description, particularly AI-generated descriptions—the positives, the negatives, and what’s still missing.
Table of Contents
State of Image Description on Android:
Before delving into image recognition software, let’s briefly discuss text recognition.
Android offers a range of apps capable of extracting text from images, some of which achieve high accuracy, even recognizing handwritten text. Receiving an image with text is usually not problematic, given the availability of apps like Envision, Lookout, Eye-D, and more, which can read the content.
Recognizing the content of an image, such as the objects it contains, is a more complicated process. Apps like Envision typically provide a general overview sentence about the image. While this approach works in many cases, it may falter with more unconventional images and often lacks in-depth details.
Apps featuring item identification in images often present you with a list of item names without a clear connection between them and without providing information on the spatial arrangement of elements in the photo.
The most promising image description services are those generated by AI. There’s one app designed specifically for the visually impaired, called From Your Eyes, which claims to possess this capability. Unfortunately, despite good intentions and the creators’ passion, the AI model used in the app is, to put it mildly, inadequate. It struggles to accurately describe images and frequently misses crucial details. The app has an image recognition service operated by volunteers, but that’s unrelated to the subject of this article.
Google has recently introduced AI descriptions with the ability to ask follow-up questions in the Lookout app. However, Google has adhered to its tradition and made this feature available in only a few countries, with a promise to expand it to other regions in the future. This feature likely leverages the Bard AI chatbot developed by Google.
Another competitor, Open AI’sGPT-4, is integrated into a different app for the visually impaired called Be My Eyes. However, this feature, unfortunately, remains exclusive to iOS, with a promise of future availability on Android.
Nonetheless, GPT-4 can be accessed through other means, such as a Chat GPT Plus subscription or Microsoft’s Bing Chat, and Bard can be opened in your browser. So, AI image descriptions are not entirely absent on Android.
My evaluation of the 2 services (Bard and GPT-4 via Bing Chat):
I tested recently both Google Bard and GPT-4 via the Bing Chat image description capabilities. I should mention that I did the tests on my Windows PC because I still find it easier to paste the images there, and also because the focus is on the results of the description service itself, not on how the test is conducted.
Promising Identification Results:
Promising Identification Results:
Bard and GPT consistently delivered meaningful responses containing many image details. I conducted tests with four images, none of which included people. Two of the images featured food, while the other two depicted Lego constructions. Both services accurately described most of the details leading to a good understanding of the main contents in the image, despite some mistakes that I will discuss later.
Both services successfully recognized my nephew’s Lego swimming pool with the Lego figure swimming in it. They also identified the other photo as another Lego creation and correctly labeled some of the food in the meals.
Without a doubt, these services provide far more detailed information than an app like Envision. Moreover, you can ask them follow-up questions.
For example, Envision mistakenly categorized the Lego swimming pool as a diagram. It’s also reasonable to assume that they excel at providing additional content-related information, such as recognizing memes, as images can sometimes be more than just objects. However, I didn’t personally test this aspect.
Both AI services attempt to read the text, such as logos and other textual elements, present in the image in addition to other content. Bing Chat performs better in this regard, but I’ll delve into that in more detail later. It’s valuable to be able to read text and identify other image content seamlessly.
While I’m impressed with the overall performance, these services are still not flawless and encounter some issues.
It’s a well-known fact that AI can sometimes produce hallucinations, creating elements that don’t exist in reality. This became evident during my testing.
Bing, for instance, not only misidentified fried chicken as a burger but also inaccurately claimed that this burger contained lettuce and tomato. Additionally, it incorrectly stated that the fried potatoes had salt on them.
Bard confidently asserted the presence of a plate of soup beside the bowl of food in one of the photos, despite there being nothing of the sort. In another instance, Bing described a pond, complete with ducks and boats, in a photo that contained nothing resembling a pond.
Furthermore, Bard mentioned a sunny day while describing the Lego pool photo, even though the photo was taken indoors and the sun was not visible.
We should expect misidentifications from an AI service that is still under development. While some degree of misidentification is accepted, the issue arises when it’s coupled with hallucinations and the tone of the bot, which I will discuss in the next point.
For example, GPT-4 misidentified a Lego-created car as a Lego building and proceeded to describe it as such. It even fabricated a Lego shop sign on the imaginary building, even though there was no building or Lego shop text present.
In another instance, GPT substituted chicken with mushrooms in a homemade plate of food, despite no mushrooms being visible in the entire photo, not even in the background. Additionally, when presented with a spoon and fork, it added a knife to the set, although it wasn’t there in reality.
Bing Chat also didn’t accurately describe all of the details of the Lego figure. It appeared to be confused by a part of another figure visible in the photo. Interestingly, it didn’t initially acknowledge the presence of this additional part until I asked if there was another figure. It responded that there is an incomplete one, adding that maybe I ran out of bricks before I was able to complete it.
Bard, on the other hand, refused to recognize prominent objects such as a Coca-Cola can. I even inquired about the presence of a drink, and Bard replied that there was none. Regarding the incomplete Lego figure, Bard did not take into account that a part was missing. It stated that there were 2 Lego figures in the pool.
Another common issue with AI bots is their tone. They often respond with a convincing and overly confident tone. While both services explicitly state their AI nature and the possibility of providing incorrect information, particularly Bard, where Google emphasizes its early-stage development, the assertive tone employed by the bots carries a risk, especially when the information provided is inaccurate. Perhaps they should be more cautious in their tone to convey the impression that the bot is uncertain about the identified object.
Inability to Identify People, For Now:
Certainly, the decision not to identify and describe people is a justified choice, at least at this stage. Bard explicitly stated, when I showed an image containing people, that it doesn’t identify individuals and declined to provide information about other aspects of the image.
Bing Chat took a different approach. While analyzing an image, it mentioned that faces would be blurred. In the same image where Bard refused to comment, Bing discussed other elements in the image, such as the landscape, but entirely disregarded the people. In this case, it did inform me about the previously faked pond and boat, but the issue is that it didn’t acknowledge the presence of people or their description.
Since I was familiar with all the images I presented, I asked whether there were people in the photos, and the bot confirmed their presence but stated it couldn’t describe them. For someone unfamiliar with the image content, which is often the case when we encounter images, the presence of people might go unnoticed, leading them to believe it’s solely a nature photo.
Exaggeration and Over Positivity:
In this aspect, GPT clearly stands out. For GPT, every photo is a masterpiece, a work of art. Everything is portrayed as perfect, even if the background happens to be coincidental, in GPT’s view, it’s the perfect background. Food becomes delicious and expertly prepared, items are arranged exceptionally well, and the atmosphere is consistently delightful.
Of course, the photos I sent were not subpar, but they were regular photos taken in very ordinary conditions. I didn’t experiment with a photo taken by me to see if GPT might offer criticism or continue its overly positive trend, portraying it as an artistic masterpiece.
I should note here that I am not sure if Chat GPT may have the same issue when describing images because I am not subscribed to the service. However, based on my other interactions with Chat GPT, I am leaning towards the idea that it does face it.
In contrast, According to my tests, Bard maintains a more neutral tone compared to Bing. It compliments things but doesn’t go overboard with excessive praise.
Differences between Bard and Bing Chat:
Despite their shared strengths and weaknesses, there are still differences between Bard and Bing Chat.
In addition to what was previously discussed, Bing was slower to respond, but it provided more detailed information.
Conversely, Bard responded quickly but with fewer details, occasionally omitting text elements like logos. I am referring to the “describe this image” prompt, although it’s possible to request more details from Bard by asking more questions.
Furthermore, there were instances where one service misidentified items that the other correctly recognized. For example, Bard correctly identified the Lego car, while GPT-4interpreted it as a building. GPT also detected rice as an ingredient in a meal, which Bard did not. In my experience, I find GPT-4 via Bing Chat to be superior in image recognition, offering more details and accurately recognizing a greater number of items.
Are the Services Exciting in Their Current State?
There’s no need to hesitate when answering this question – absolutely, they are. AI bots and APIs are still evolving, and they are progressing rapidly. The current results are quite impressive for the most part.
For now, it’s essential to approach them with caution. We should aim to make the most of what they offer while acknowledging their limitations. We can rely on their presence but not place complete trust in them. It’s worth noting that asking about the same image for a second time may yield a different result or a different tone. I only tested with one photo using Bing. It retained the original description but provided fewer details, and it asked me if I liked to create Lego creations. I didn’t test further because the possibility for AI bots to change their responses is not uncommon currently.
I hope that the wait for Be My AI feature of Be My Eyes app won’t be too long, and that Google will expand the Lookout AI service to benefit users worldwide. Moreover, the door is wide open for other apps to emerge, utilizing prominent AI bots or other technologies.
The ability to recognize images might eventually lead to real-time camera view recognition. It’s needless to say how valuable such a capability could be for a blind person and how much it can enhance daily life. While it may be premature to be overly optimistic about this possibility, I believe it’s no longer an impossible dream.
Even in its current state, where image identification occurs after the image is captured, image description is an exciting development with significant potential.