Serendip is an independent site partnering with faculty at multiple colleges and universities around the world. Happy exploring!

You are here

Midterm Project (finally): an Automatic Captioning App

Charlie's picture

I have been very, very slowly catching up to my work after dealing with a new onset of symptoms this semester, but I finally have my midterm project finished! I have decided to do a semester-long project and continue working on this during finals, but here is an update on my project, subScript.

Critical Disability Studies Midterm Project: subScript



subScript is an app that will automatically generate captions for real life conversations using text-to-speech and machine learning software. The app will go beyond just providing a literal text representation of speech however, and will try to capture qualities of the sound that are lost in a typical text-to-speech context. Inspired by work by Deaf artists like Christine Sun Kim, the app will relate these qualities through features like providing text in different colors to convey pitch, different sizes to convey loudness, and possibly even replicating the sounds' vibrations by vibrating the phone. 


This is the idea I have for the app, but likely for the scope of this course I will just be making the most basic form of this with pure text-to-speech capabilities. 


Research & inspiration: 

Here is some more information on the research I have done into accessible app design and Christine Sun Kim's art with subtitles that are inspiring the design for the app.


The idea for this app originally came from a desire to have an app which would allow for easy, real-time translation between spoken and written word for conversational use. As a hard-of-hearing person, subtitles and other assistive communication technology have been at the forefront of my mind for a long time. Though I still have enough hearing to get by alright in hearing communities, I am always imagining ways I could reduce the strain of listening fatigue that I and other HOH and d/Deaf people experience. As a computer scientist, I am frequently drawn back to the question of how technology and computer science can be used to increase accessibility. At the intersection of my identity and my academics is natural language processing and speech-to-text technologies, and I am hoping to explore that area further.


At the same time, I don't want to stop at verbatim translation. There are many more nuances to language and sound beyond words, and I want to capture some of those qualities in this app. This aspect is inspired by Christine Sun Kim's work in Closer Captions, in which she explores the ways that movie and tv captions could be more thorough and creative in their approach to capturing sound, and expresses her frustrations with inadequate captioning.


I also want to make sure the app is accessible overall, and so I have been researching accessible design in computer science to help with that. 






  • The main page will be easy to navigate with one clearly indicated center button that begins the transcription.
  • The transcription text will be written in a large, high-contrast, and easily legible (sans serif) font.
  • The side menus (settings, record) will be accessed either by swiping or tilting the phone, in addition to button options.
  • Finally, the app will be highly customizable via the settings menu. Users can change text and background color, disable/enable tilting, and more.


More information about machine learning and the future of the app:

The app will be made using TensorFlow Lite, a machine learning software which comes with a pre-built and pre-trained text-to-speech model. Since machine learning models take an extensive amount of data to train (in this case sound files of people talking), using a pre-trained model is both more cost effective and faster. In the future, adding features like different colors and sizes to represent pitch and volume, as well as the sound vibration replication, will likely require me to retrain the text-to-speech/sound capturing machine learning model. That way, it will be trained to distinguish voice pitch, volume, and vibrations as well as translate words to text. For that reason, that is a much longer term goal that I would hope to add onto this app in the future. I would also work to keep the app up to date with accessibility features and debug as needed. 


Thanks for reading!



Kaitlin_Lara's picture

Hello Charlie, 

I really love this future app idea. In most cases, you may need to add the subtitle to the video manually since most auto-generated subtitles aren't so accurate right now. I have recently seen people use closed caption on their videos but not enough. For instance, Tik Tok has a close caption option but it is still flawed. There are many instances where people in the comment make a remark about the mistake in the caption. One major flaw is also the language barrier. For example, it is common for a person to code switch between two languages or more their video but the caption never captures it. I do utualized closed captions because sometimes I cannot hear the audio clearly or cannot put my volume loud to hear. Also, on the topic of closed caption in different languages I think it would be amazing and helpful if apps like Tik Tok offered the feature to click the language you want the closed capture in such as YouTube offers. Howver, there are still some flaws on YouTube subtitled features. The subtitle will not capture someones dialect or accent. In relation to the Spanish language, most subtitles or closed captions are based on Spain Spanish. Although, Spanish is the language spoken in Latin America countries and Spain, there is region dialect and differences. Therefore, I believe it is very crucial for close-caption software to take this into account. 

 I really apperciate that you mention that "the app will relate these qualities through features like providing text in different colors to convey pitch, different sizes to convey loudness, and possibly even replicating the sounds' vibrations by vibrating the phone." This is such a great idea because if the text is in different sizes it provides great contexts. I also really like the idea about replicating the sounds' vibration by vibrating the phone. I believe this is very possible. It is a further step to make close captioning and captioning in general more inclusive.