Last updated on 10 December 2024
There is no doubt about how important describing images is for blind users. However, being able to hear descriptions of videos is equally important. Video descriptions can be more challenging, though, as the description service needs to establish a connection between scenes to provide meaningful context. With the rapid advancements in AI and LLMs (large language models), these technologies have stepped into the realm of video analysis. It didn’t take long for initiatives to emerge, aiming to harness this progress.
Currently, there are two Android apps designed for blind users that include video description: Microsoft Seeing AI and PiccyBot. Each app adopts a different approach to video description. This article compares how the two video description features in these apps function, without focusing on the results obtained or the accuracy of their video descriptions.
Table of Contents
A Quick Overview of the Apps
Seeing AI, an app developed by Microsoft, offers a suite of features targeted at blind users. These include instant text and document reading, product identification, and image description. Version 1.2 introduced the app’s first iteration of video description, and at the time of writing, this remains the most recent version.
PiccyBot, on the other hand, focuses entirely on image and video descriptions. Initially launched a few months ago with image descriptions only, it quickly expanded to include video descriptions. The current version, as of writing, is version 1.42.
A Comparison Between the Video Description Features of Seeing AI and PiccyBot
Completely Different Approaches
The first notable difference between Seeing AI and PiccyBot’s video descriptions lies in how the descriptions are delivered.
After recognizing the video, Seeing AI plays the video and pauses it to describe the first scene. It then resumes playing for a few seconds before pausing again to describe the next scene. This process continues throughout the video: playing short segments and pausing to provide descriptions until the video ends.
The descriptions are spoken using the TTS engine specified in the app, which relies on a TTS engine already installed on the device. These descriptions cannot be read in full by the screen reader. Each short description is visible to the screen reader on the screen until it is replaced by the next scene description.
The tests indicate that the service does not take the audio of the video into account. It seems to focus solely on analyzing visual frames, without integrating sounds like dialogue or background audio into the analysis, and consequently, these elements are missing from the provided descriptions.
In contrast, PiccyBot adopts a completely different approach. The video is uploaded to the AI model for analysis, and the description is sent back to the user. The user can choose to listen to the description using one of PiccyBot’s supported voices or read it with the screen reader. Unlike Seeing AI, there is no synchronization between the actual video and the provided description.
Once the video is compressed, PiccyBot plays the video while the video is uploading to allow the user to verify the video and its duration.
Another key difference is that PiccyBot incorporates video sounds into its analysis. The descriptions include details about dialogue and other sounds present in the video.
Asking Follow-Up Questions
No matter how good a provided video description is, sometimes being able to ask follow-up questions is key to finding what the user needs to know from the video. Even though Seeing AI allows asking additional questions when describing images, this functionality is still not available for videos.
PiccyBot, however, does not differentiate between images and videos, allowing follow-up questions for both. In a recent instance, I was able to ask several questions about a particular character in a video, obtaining details about their scenes and state throughout the video.
Supported Videos
Seeing AI’s video description in its current iteration supports only local .mp4 videos. In contrast, PiccyBot extends its support to online services, specifically YouTube and Instagram videos. During testing with YouTube, sharing the video link with PiccyBot initiated the process. This included downloading the video to the device, compressing it for upload, and then uploading and processing it to generate results. Both downloaded and compressed videos are currently stored on the device.
Capturing Videos
Seeing AI does not offer the ability to capture a video for analysis at the time of writing. It only supports sharing videos from other apps. PiccyBot, on the other hand, allows both sharing existing videos and shooting new ones directly within the app.
Used AI Models and Customizations
Although the exact AI model used for Seeing AI’s video descriptions is not disclosed, it is reasonable to assume that it utilizes one of OpenAI’s GPT models, given that Seeing AI is a Microsoft-developed app. Customization options are currently limited, with the only adjustable feature being text-to-speech settings, which affect not just video descriptions but other app features that use speech.
In contrast, PiccyBot offers a broader range of customization. Paid users can select from several AI LLMs, including popular ones like GPT-4o, 4o Mini, and Gemini Flash/Pro/Experimental101. Other customizable options include the number of tokens (affecting description detail), video quality, and speech settings. However, PiccyBot does not use offline TTS engines for reading descriptions.
Cost
Seeing AI is completely free to use. On the other hand, most of PiccyBot’s customization features are behind a paywall. Users can opt for a monthly subscription or a one-time lifetime purchase of approximately $20. While video descriptions are available for free, they come with more limited durations and lack access to customization options for the AI model or other settings.
Video Duration and Size
Seeing AI supports videos up to 10 minutes long. When attempting to share a video that surpasses the allowed duration, a message appears stating that the video exceeds the 10-minute limit. While there is no official information about the maximum file size, it is estimated to be between 100 and 150 MB based on user observations.
PiccyBot, however, is more restrictive regarding video duration. While there is no information about the limits in the free version, the Pro version caps video durations at 5 minutes. For videos longer than 5 minutes, the service uploads only the first 5 minutes. According to the developer, when dealing with YouTube videos that include captions, the app attempts to summarize the content based on the captions if the video exceeds the 5-minute limit.
Regarding file size, there is no official information available for PiccyBot. However, the app compresses videos before uploading. Interestingly, in two instances involving YouTube videos, the compressed files ended up being larger than the original videos rather than smaller.
Number of Videos Per Day
Seeing AI explicitly limits users to 10 video uploads within a 24-hour period. On the other hand, PiccyBot does not mention any specific restrictions on the number of videos that can be processed daily.
Stability and Reliability
Seeing AI requires a strong internet connection with very good upload speeds to function properly. This limitation has prevented me from successfully uploading most videos, even small ones typically under 5 MB. The main issue is that Seeing AI does not allow sufficient time for video uploads, often displaying an error message to retry in less than a minute after initiating the upload.
Processing speed is another area where the app struggles. According to users with decent internet speeds, processing videos can still take a considerable amount of time, even for shorter clips.
On the contrary, PiccyBot handles uploads much better by not imposing strict time limits on the upload process. This makes it a better option for users with slower upload speeds. Additionally, PiccyBot provides real-time updates on the video’s progress, showing whether it is in the compression, uploading, or processing stage. Waiting times can vary based on factors such as internet speed, the selected AI model, and the server’s current traffic. However, occasional server errors should be expected.
Opinion and Final Remarks
Both Seeing AI and PiccyBot’s video description features have their strengths and weaknesses. However, I find myself leaning towards PiccyBot, influenced by my personal experience and the fact that it is developed by an independent developer. Despite limited financial resources, this developer successfully introduced video description on Android before larger companies. Additionally, PiccyBot’s forgiving nature when dealing with slower internet speeds is a significant advantage.
Seeing AI’s restrictive upload time is a major drawback. The short window provided for uploading videos can be frustrating, particularly in areas with slow internet connections. I reported this issue and hope it will be addressed soon. If left unresolved, this limitation could become a barrier, preventing many users from benefiting from Seeing AI’s video description feature. This undermines inclusive and equal access, especially given the disparities in internet speeds across different regions.
The different implementations of video description between the two apps offer users flexibility, as each approach can be more suitable for certain types of videos. Seeing AI’s syncing of descriptions with video playback is impressive and less likely to omit details. On the other hand, PiccyBot’s inclusion of video sounds and the ability to ask follow-up questions make its descriptions richer in certain contexts. Moreover, PiccyBot’s integration with popular AI services allows users to compare descriptions from different AI models. By simply changing the AI model in settings, users can reprocess the video without re-uploading it, as long as the session remains active.
When using either app for video descriptions, it is essential to keep in mind that the technology is still evolving. Describing videos is inherently more complex than describing single images.
One area where Seeing AI excels is its simple and user-friendly interface. This simplicity arises not only from the absence of customization options but also from PiccyBot’s occasionally cluttered interface, which can be confusing. However, PiccyBot is frequently updated, and I remain optimistic that the developer will address these UI quirks in future releases. Server issues and occasional service interruptions with PiccyBot are also worth noting, but these challenges are understandable given the developer’s limited resources. Even large companies sometimes struggle with server reliability and expenses.
This comparison highlights the offerings of the two main video description services available on Android for blind users. Both services are beneficial in their own ways and have room for improvement. The presence of more services in this space would further benefit users.
Investing in supporting video description services and apps could be the key to keeping competition active and to satisfying both developers and users alike, so developers get incentives to create and users have more choices to select from, or even better, to equally use to achieve the most accurate results.
Over time, this article may be updated or complemented by additional articles as these services evolve. I remain hopeful for continued advancements and improvements in this area, bringing positive developments for all.
I am an assistive technology specialist in Yakima, Washington who is Blind since birth. I’m also a co-editor of the Washington Council of the blind newsline quarterly magazine. The audio version is now available by email, enlarge print, and now nationally through bard. I am writing to request permission to reprint your excellent article on video description for our winter issue. The only immediate way I could see to contact you was by leaving a comment, smile. I would include your author bio and your request for donations to support the work, along with the direct link. Please get in touch if you are willing. I will mask the address, But you can write to me at.
TheWCBNewsline at gmail.
Respectfully,
Reginald George.
Assistive technology specialist
Washington Department of Services for the Blind