Last updated on 25 August 2024
The AI image recognition and description field is demonstrating AI’s strength in transforming the lives of blind people, particularly with the increasing number of services and apps utilizing AI image service models for the convenience of blind users. Recently, a new image description app was introduced called PiccyBot. The app is not just another description service added to the market; it stands out with its noticeable potential and ambitious approach. Let’s take a look at the PiccyBot app, how to use it, what makes it worth attention, and some observations and remarks.
Table of Contents
What Sets PiccyBot Apart from the Competition
Video Analysis and Description
The most important aspect of PiccyBot is its video description feature. As of this writing, no other service offers this capability. With PiccyBot, you can either take a video or share one with the app, and it will describe it for you. The exact length of the video may vary between the free and pro versions, and you can know if any part is removed after uploading the video because it is played automatically while it is processed. You can specify the amount of detail in settings by choosing high, medium, or low.
Using Voice Responses
PiccyBot is designed to provide spoken responses instead of relying on text responses like other known services. It reads the responses using human-like speakers. In the Pro version, users can change the voice and speaking rate. Additionally, users can review the response using a screen reader, as the response is also displayed on the screen.
Personality
PiccyBot tries to take description to a different level by adding a sort of personality to the responses and how the voice reads them. This is evident in the different tones that the user can select, as each speaker has a specific character, such as serious, expressive, caring, happy, and more. The most prominent aspect of this personality is in the responses themselves, where the bot attempts to engage in the conversation like a human. This is shown through expressions like “aha,” “oh let’s see,” and engaging questions such as “beautiful, isn’t it?” as well as intros to the responses like “let’s see what is included in the image…”
After testing this personality feature, you will either love it or hate it. Personality is on by default, and there is no way to disable it unless you are using the Pro version.
Utilizing Not Only One but 7 AI Models
With most image description services, you typically cannot choose the AI model you want to use, and some services don’t even disclose the AI model they utilize. PiccyBot’s approach is different. It puts seven AI description models at your disposal, allowing you to switch between them easily. The AI models used are:
It is worth noting, however, that selecting the model is only possible with the Pro version. There is also an observation related to GPT 4o that I will discuss later in the article.
The Pro Subscription
Although PiccyBot has a free version, it is limited in many aspects. Users of the free version cannot choose the AI model, turn off the personality feature, change the speaker or speech rate, or share the audio responses as MP3 files. Additionally, the free version is ad-supported. However, I didn’t find the ads intrusive during the short time I spent using the free version before switching to Pro. To truly take advantage of PiccyBot’s capabilities, purchasing the Pro version is a must, in my opinion. You can opt for a monthly subscription or a one-time lifetime payment.
UI and Usage
PiccyBot app’s UI is straightforward and simple to use. The main elements include the “What’s in this image?” prompt, the “Camera” button to take a picture or choose a photo, the “Video” button to take or choose a video, and the settings button that takes you to the app’s settings. You can explore the elements by swiping, with the image and video buttons located near the bottom of the screen, and the settings button at the top.
When you tap the camera or video buttons, you are presented with the choice to “Take Photo” or “Choose Photo,” and “Take Video” or “Choose Video,” respectively. When taking a photo or video, you will be using your phone’s camera interface. This means that if your phone’s camera supports features like face detection, you will hear when there are faces in the view. After taking the photo or video, you will be asked if you want to try again or press OK to start the recognition.
If you choose to select a photo or video instead of capturing one, you will see the apps that you can use to share the photo or video from, including file managers, making the selection process easy. Additionally, you can share a photo or video to PiccyBot from other apps without opening it first.
Once you have selected or captured the photo or video, the recognition will start, accompanied by ongoing tone which you can disable in the settings if you prefer. During the process, you can use your screen reader to find out the stage of recognition, such as “uploading” or “processing.” Note that sometimes the screen reader may not read the state completely, only announcing a few letters of the word instead of the entire one.
When recognition is done, you will immediately hear the response. The same response can be navigated using your screen reader. You can disable spoken responses by selecting “none” in the app’s voice selection setting.
After hearing the response, a new “Microphone” button will appear, allowing you to ask a specific question about the recognized image or video. The camera, video, and settings buttons remain on the screen. After asking your question, it will be processed with the same tone playing, and the response will be read out loud to you. During testing, I found that only the last response is displayed on the screen, and there is no way to see previous history or other messages in the conversation.
Currently, there is no text box to type your response if you prefer typing, but as a workaround, you can clear the “What’s in this image?” prompt and type your text response instead after the initial image recognition is done.
After the image is recognized and the response is given, a share button appears, allowing you to share the actual image or save it using a file manager.
The settings section allows you to change the AI model, the speaker and its rate of speech, the language, the video recognition detail, and to toggle the personality feature.
The settings also contain the “Share Audio” option, which shares the most recent response as an MP3 file. Having this button in settings is strange, as users expect it to be next to the other share button, not inside settings.
When choosing speakers, you can use the play button to listen to the voices before selecting them. After changing a setting like voice or recognition service model and going out of settings, the recognition will start again, which is a welcome feature.
The “length” setting, a part of the model controls, specifies the number of tokens in the response, impacting how detailed the response can be. The length, which can be set between 10 and 100, can be increased or decreased using the “Increase length” and “Decrease length” buttons. The number of characters a token represents depends on the AI model used.
Since the free version’s customization is very limited, you will find that most of the settings cannot be modified if you are using the free version.
Testing and Observations
Supporting seven AI models instead of just one gives the app a significant advantage during image recognition. This allows for comparisons between services and offers multiple perspectives on the image or video. It’s easy to change the model from settings and then recognize the same image or video again. Note that all my tests were conducted using English as the recognition language.
Video Description: The Game Changer
While there is no shortage of image description services, video description is the new player here. My testing showed that the video recognition works effectively, providing descriptions that depict the details of the video. These descriptions are not isolated elements or frames but are narrated in a style that links the details together throughout the video.
This doesn’t mean that video description is perfect though, as it is easy to confuse the AI models, especially when there are a lot of details. My video testing was conducted entirely using GPT and Gemini, and both showed confusion in certain aspects. For example, Gemini assumed that words spoken by a child who was not visible in the video frame were from an old lady who was appearing in the video.
Issues with video description cannot be blamed on the app itself; they are related to the AI models used. With more progress in the models themselves, video results are expected to become more accurate.
The overall experience with video description shows that the feature is not just a gimmick. The number of details and the descriptions are useful, but as always with AI, you should watch out for hallucinations and detail confusions.
When trying to describe one video, I was stuck in the compressing state for more than 2 minutes. I encountered the same outcome with the video after testing it several times. However, after I sent the same video to myself on WhatsApp then shared it to PiccyBot, it was recognized with no issues. The video was trimmed using AudioLab, and while it’s not the only video trimmed using this tool, it was the only one with which I faced this issue.
General Observations
GPT-4o: An Unsolved Mystery
When I selected GPT-4o as my model of choice, I expected great outcomes, knowing it is the most advanced GPT available and based on my previous experiences with GPT-based image description services like the “Be My Eyes” Be My AI feature. However, the results were opposite to my expectations. During most of my tests, including videos and photos, the results were not impressive. Judgmental statements were prevalent, details were lacking, and misidentifications were prominent.
In a video where an old lady was squeezing oranges and then giving a kid a cup of orange juice, the model missed many details and invented a completely fake conversation between the lady and the kid, although the lady didn’t say a word during the video. It also assumed that she was the kid’s mother, an assumption made without any proof. It also gave unnecessary comments about the essence and environment of the video, such as love and care.
In images, things were even worse. The President of the United States and the President of Russia were not identified, and when asked about the type of plant, which was the focus of an image, it identified it wrongly twice. The plant, by the way, is the popular Gardenia. “Be My AI,” which uses GPT (although I am unsure if it uses GPT-4o), showed much better results. Gardenia was easily identified, and as I showed in a previous test, it could identify the presidents without any trouble.
These observations baffled me. A clarification from the PiccyBot developer would be necessary, and this part would be updated if more information is received.
Update: The PiccyBot developer has responded, confirming that the app uses the GPT-4o model. He suggested that the superior descriptions provided by Be My AI might result from a closer collaboration between the Be My Eyes team and OpenAI, potentially giving them access to more advanced GPT capabilities not available to other developers. Additionally, he noted that the prompts used could impact the quality of the descriptions generated. He assured that he is continuously refining the prompts and improving the app to deliver the best possible descriptions.
Gemini: The King of the Show
Contrary to the results of using GPT-4o, Gemini, to my surprise, gave great results. Using both the Flash and Pro models, it was simply able to change some of my previous ideas about Google’s AI models.
In videos, it excelled in descriptions and provided detailed information. It properly identified many of the kid’s words spoken during the video, but although they were in Arabic, writing them in English text. It even identified some playful things that the kid said. In the same video though, it made a few mistakes, like considering a flower that was not part of the kid’s clothes as a drawing on his shirt, and assuming that a word another kid was saying was spoken by the old lady.
In the image of the presidents, Gemini identified them easily, and the Flash model went further to provide details about their terms in office and what those times were known for. It is surprising how the same Gemini that refused several times to answer simple questions when I used it as an assistant (like “Who is the President of the United States?”) acted differently with image descriptions.
Gemini also identified Gardenia and provided evidence supporting its identification. When presented with an image of an Arabic document, it translated the text but mentioned that it couldn’t display the original Arabic text.
Generally, the Gemini models showed that they are the best, with fewer hallucinations and more accurate, successful recognitions.
Other Models and Additional Observations
I have to admit that my main focus was on GPT and Gemini. Conducting a few tests with the other models showed that most of them are not up to the competition, with some having a high tendency to judge and comment on the scene and what it means. They were generally less detailed with faces and people, with one of them citing ethical reasons to not answer my question when I asked it to identify the presidents’ identities.
Anyway, the models are available, and the aim of the article is not to compare them all. It is up to the user to utilize the model they prefer.
I am unsure about the model used in the free version, but my testing of the free version didn’t impress me. Both personality and judgmental (mainly positive) comments didn’t help. The tone used and the expressions are very exaggerated, which is not how humans typically talk. This, in my opinion, made the experience less enjoyable. However, this is subjective, and I have heard from a few people that the personality doesn’t bother them.
A serious issue during free testing was the confident wrong responses. When I asked the service to identify the types of toy cars, it confidently gave me wrong identifications, using “without a doubt” expression. I couldn’t retest the same photo after upgrading to the pro version because it was a photo captured during the time the kid was playing with the cars, and replicating the same view wasn’t possible. However, I can assume that the results wouldn’t be the same if I did, and that toggling off the personality feature and using other models would have yielded different outcomes.
Another example was when taking a random photo of a room. The default free identification told me how organized the room was, although this was not the case, and there was nothing extraordinary in the photo, aside from my poor photo-capturing skills and the not very organized state of the room itself.
If you are planning to use PiccyBot’s free version, you should be aware of its limitations and always consider comparing the results with other services.
It is important to remind the reader that AI services are not consistent. A certain detail might be correctly described once and then completely missed when recognizing the same image or video for a second time. This inconsistency is not specific to PiccyBot; rather, it is a general issue with image description models.
Asking Follow-up Questions
Although there is no direct way to ask about a photo after it is recognized using text prompts, having the ability to ask about certain details of the image is beneficial. The voice recognition feature is very reliable, as I didn’t encounter any problems when asking my questions. However, I’m still uncertain if each question is treated separately or if all the questions and answers are considered part of one ongoing conversation.
During testing, I asked a general question unrelated to the recognized image and received a response stating that the answer to my question is not in the image. While this outcome is completely acceptable and doesn’t indicate any issues with the services or the app, it’s worth noting for informational purposes.
Final Remarks
PiccyBot enters the market with ambitious goals and true capabilities. Being the only app currently offering video description and taking advantage of multiple image recognition AI models, as well as voice responses and the personality features, PiccyBot provides significant value, distinguishing itself from other image description apps.
While the true strength of the app lies in its Pro version, having a free version is appreciated. Moreover, the Pro version is reasonably priced, whether for the monthly or lifetime subscription.
As the app is still new, there is room for improvements and refinements in future versions, such as giving users the ability to stop spoken responses if desired and making it easier to ask follow-up text questions with the whole conversation displayed.
While the waiting time may be slightly longer at times, this should be expected and shouldn’t be held against the app.
Despite not encountering failed recognitions often, it would be beneficial to include a retry button for such situations. This would allow users to quickly resend the image or prompt.
The ability to share audios and actual images is a welcome addition. However, it is recommended to relocate the sharing of audio from settings to the main screen for better intuitiveness.
Like many users, I am eagerly following PiccyBot’s developments with enthusiasm and belief in its potential. PiccyBot serves as another proof of the crucial role skillful developers can play in the lives of blind people, transforming smartphones and tablets into essential tools for enhancing their independence and combating barriers associated with sight loss.

I have been using PiccyBot since the day it was made available on the play-store. However, I only subscribed to the pro version after the video description feature was added earlier this week. I’ll admit, my primary usage has been with GPT-4o, but the results have been quite promising! The video descriptions in particular have impressed me to no end so far. Okay sure, the results are not always 100% accurate, but the fact that as a blind/visually impaired person I can actually get a more than decent idea of what’s taking place in a video is spectacular. I’ll need to try out the other AI services soon, but PiccyBot is hands down the app of the year. The developer and I have gotten to interact a few times on Facebook, and I definitely got the impression that he’s always open to receiving feedback, as I have reported a couple of bugs over the weeks and they’ve all been fixed in subsequent updates.
The GPT-4o descriptions are not bad, but they didn’t meet my expectations. In fact, the consistent issues were the main reason for dedicating a section of the review to discussing unreliable descriptions when using GPT-4o, especially since comparing the same images with Be My AI showed how much better GPT was performing there. Maybe you are luckier than me in this. I suggest you do some comparisons using Gemini Pro. If you can, then report back with your findings; it would be appreciated. Indeed, the PiccyBot app has great potential, especially if it keeps up with advancements in AI models. Day by day, AI is showing significant progress in image recognition and description, and I hope that the current limitations and confusions will be minimized soon.
I don’t know why the free version is so slow. I don’t know if buying the premium version will change anything, moreover, the installation is almost unusable even with the free version. it cannot even describe images, only videos
There was a server issue yesterday. Have you tried to share an image recently?