inproceedings by Saad Hassan
This paper describes the design of Mobile Assisted Second Language Learning Application (MASLL) -... more This paper describes the design of Mobile Assisted Second Language Learning Application (MASLL) - Kahaniyan - created to assist non-native primary school children in learning Urdu. We explore the use of gamification to assist language learning within the context of interactive storytelling. The final design presented in this paper demonstrates how psychological and linguistic aspects coupled with contextual task analysis can be used to create a second language learning tool. The study also reports the results of the user study and the evaluation of the application which was conducted with 32 primary school students. Our results show a positive influence on learning outcomes, with findings that hold great significance for future work on designing MASLL for languages written in Arabic or Persian script.
Advances in sign-language recognition technology can enable users of American Sign Language (ASL)... more Advances in sign-language recognition technology can enable users of American Sign Language (ASL) dictionaries to search for a sign, whose meaning is unknown, by submitting a video of themselves performing the sign they had encountered, based on their memory of how it appeared. However, the relationship between the performance of signrecognition technology and user satisfaction of such search interaction is unknown. In two Wizard-of-Oz experimental studies, we found that in addition to the position of the desired word in a list of results, the similarity of the other words in the results list also affected user satisfaction.

We have collected a new dataset consisting of color and depth videos of fluent American Sign Lang... more We have collected a new dataset consisting of color and depth videos of fluent American Sign Language (ASL) signers performing sequences of 100 ASL signs from a Kinect v2 sensor. This directed dataset had originally been collected as part of an ongoing collaborative project, to aid in the development of a sign-recognition system for identifying occurrences of these 100 signs in video. The set of words consist of vocabulary items that would commonly be learned in a first-year ASL course offered at a university, although the specific set of signs selected for inclusion in the dataset had been motivated by project-related factors. Given increasing interest among sign-recognition and other computer-vision researchers in red-green-blue-depth (RBGD) video, we release this dataset for use by the research community. In addition to the RGB video files, we share depth and HD face data as well as additional features of face, hands, and body produced through post-processing of this data.

Prior work has revealed that Deaf and Hard of Hearing (DHH) viewers are concerned about captions ... more Prior work has revealed that Deaf and Hard of Hearing (DHH) viewers are concerned about captions occluding other onscreen content, e.g. text or faces, especially for live television programming, for which captions are generally not manually placed. To support evaluation or placement of captions for several genres of live television, empirical evidence is needed on how DHH viewers prioritize onscreen information, and whether this varies by genre. Nineteen DHH participants rated the importance of various onscreen content regions across 6 genres: News, Interviews, Emergency Announcements, Political Debates, Weather News, and Sports. Importance of content regions varied significantly across several genres, motivating genre-specific caption placement. We also demonstrate how the dataset informs creation of importance-weights for a metric to predict the severity of captions occluding onscreen content. This metric correlated significantly better to 23 DHH participants' judgements of caption quality, compared to a metric with uniform importance-weights of content regions.

While the availability of captioned television programming has increased, the quality of this cap... more While the availability of captioned television programming has increased, the quality of this captioning is not always acceptable to Deaf and Hard of Hearing (DHH) viewers, especially for live or unscripted content broadcast from local television stations. Although some current caption metrics focus on textual accuracy (comparing caption text with an accurate transcription of what was spoken), other properties may affect DHH viewers' judgments of caption quality. In fact, U.S. regulatory guidance on caption quality standards includes issues relating to how the placement of captions may occlude other video content. To this end, we conducted an empirical study with 29 DHH participants to investigate the effect on user's judgements of caption quality or their enjoyment of the video, when captions overlap with an onscreen speaker's eyes or mouth, or when captions overlap with onscreen text. We observed significantly more negative user-response scores in the case of such overlap. Understanding the relationship between these occlusion features and DHH viewers' judgments of the quality of captioned video will inform future work towards the creation caption evaluation metrics, to help ensure the accessibility of captioned television or video.
Much of the world's population experiences some form of disability during their lifetime. Caution... more Much of the world's population experiences some form of disability during their lifetime. Caution must be exercised while designing natural language processing (NLP) systems to prevent systems from inadvertently perpetuating ableist bias against people with disabilities, i.e., prejudice that favors those with typical abilities. We report on various analyses based on word predictions of a large-scale BERT language model. Statistically significant results demonstrate that people with disabilities can be disadvantaged. Findings also explore overlapping forms of discrimination related to interconnected gender and race identities.

Television captions blocking visual information causes dissatisfaction among Deaf and Hard of Hea... more Television captions blocking visual information causes dissatisfaction among Deaf and Hard of Hearing (DHH) viewers, yet existing caption evaluation metrics do not consider occlusion. To create such a metric, DHH participants in a recent study imagined how bad it would be if captions blocked various on-screen text or visual content. To gather more ecologically valid data for creating an improved metric, we asked 24 DHH participants to give subjective judgments of caption quality after actually watching videos, and a regression analysis revealed which on-screen contents’ occlusion related to users’ judgments. For several video genres, a metric based on our new dataset out-performed the prior state-of-the-art metric for predicting the severity of captions occluding content during videos, which had been based on that prior study. We contribute empirical findings for improving DHH viewers’ experience, guiding the placement of captions to minimize occlusions, and automated evaluation of captioning quality in television broadcasts.
Advancements in AI will soon enable tools for providing automatic feedback to American Sign Langu... more Advancements in AI will soon enable tools for providing automatic feedback to American Sign Language (ASL) learners on some aspects of their signing, but there is a need to understand their preferences for submitting videos and receiving feedback. Ten participants in our study were asked to record a few sentences in ASL using software we designed, and we provided manually curated feedback on one sentence in a manner that simulates the output of a future automatic feedback system. Participants responded to interview questions and a questionnaire eliciting their impressions of the prototype. Our initial findings provide guidance to future designers of automatic feedback systems for ASL learners.

Despite some prior research and commercial systems, if someone sees an unfamiliar American Sign L... more Despite some prior research and commercial systems, if someone sees an unfamiliar American Sign Language (ASL) word and wishes to look up its meaning in a dictionary, this remains a difficult task. There is no standard label a user can type to search for a sign, and formulating a query based on linguistic properties is challenging for students learning ASL. Advances in sign-language recognition technology will soon enable the design of a search system for ASL word look-up in dictionaries, by allowing users to generate a query by submitting a video of themselves performing the word they believe they encountered somewhere. Users would then view a results list of video clips or animations, to seek the desired word. In this research, we are investigating the usability of such a proposed system, a webcam-based ASL dictionary system, using a Wizard-of-Oz prototype and enhanced the design so that it can support sign language word look-up even when the performance of the underlying sign-recognition technology is low. We have also investigated the requirements of students learning ASL in regard to how results should be displayed and how a system could enable them to filter the results of the initial query, to aid in their search for a desired word. We compared users’ satisfaction when using a system with or without post-query filtering capabilities. We discuss our upcoming study to investigate users’ experience with a working prototype based on actual sign-recognition technology that is being designed. Finally, we discuss extensions of this work to the context of users searching datasets of videos of other human movements, e.g. dance moves, or when searching for words in other languages.

Searching for the meaning of an unfamiliar sign-language word in a dictionary is difficult for le... more Searching for the meaning of an unfamiliar sign-language word in a dictionary is difficult for learners, but emerging sign-recognition technology will soon enable users to search by submitting a video of themselves performing the word they recall. However, sign-recognition technology is imperfect, and users may need to search through a long list of possible results when seeking a desired result. To speed this search, we present a hybrid-search approach, in which users begin with a video-based query and then filter the search results by linguistic properties, e.g., handshape. We interviewed 32 ASL learners about their preferences for the content and appearance of the search-results page and filtering criteria. A between-subjects experiment with 20 ASL learners revealed that our hybrid search system outperformed a video-based search system along multiple satisfaction and performance metrics. Our findings provide guidance for designers of video-based sign-language dictionary search systems, with implications for other search scenarios.

Deaf and hard of hearing individuals regularly rely on captioning while watching live TV. Live TV... more Deaf and hard of hearing individuals regularly rely on captioning while watching live TV. Live TV captioning is evaluated by regulatory agencies using various caption evaluation metrics. However, caption evaluation metrics are often not informed by preferences of DHH users or how meaningful the captions are. There is a need to construct caption evaluation metrics that take the relative importance of words in transcript into account. We conducted correlation analysis between two types of word embeddings and human-annotated labelled word-importance scores in existing corpus. We found that normalized contextualized word embeddings generated using BERT correlated better with manually annotated importance scores than word2vec-based word embeddings. We make available a pairing of word embeddings and their human-annotated importance scores. We also provide proof-of-concept utility by training word importance models, achieving an F1-score of 0.57 in the 6-class word importance classification task.

We are releasing a dataset containing videos of both fluent and non-fluent signers using American... more We are releasing a dataset containing videos of both fluent and non-fluent signers using American Sign Language (ASL), which were collected using a Kinect v2 sensor. This dataset was collected as a part of a project to develop and evaluate computer vision algorithms to support new technologies for automatic detection of ASL fluency attributes. A total of 45 fluent and non-fluent participants were asked to perform signing homework assignments that are similar to the assignments used in introductory or intermediate level ASL courses. The data is annotated to identify several aspects of signing including grammatical features and non-manual markers. Sign language recognition is currently very data-driven and this dataset can support the design of recognition technologies, especially technologies that can benefit ASL learners. This dataset might also be interesting to ASL education researchers who want to contrast fluent and non-fluent signing.

As they develop comprehension skills, American Sign Language (ASL) learners often view challengin... more As they develop comprehension skills, American Sign Language (ASL) learners often view challenging ASL videos, which may contain unfamiliar signs. Current dictionary tools require students to isolate a single sign they do not understand and input a search query, by selecting linguistic properties or by performing the sign into a webcam. Students may struggle with extracting and re-creating an unfamiliar sign, and they must leave the video-watching task to use an external dictionary tool. We investigate a technology that enables users, in the moment, i.e., while they are viewing a video, to select a span of one or more signs that they do not understand, to view dictionary results. We interviewed 14 American Sign Language (ASL) learners about their challenges in understanding ASL video and workarounds for unfamiliar vocabulary. We then conducted a comparative study and an in-depth analysis with 15 ASL learners to investigate the benefits of using video sub-spans for searching, and their interactions with a Wizard-of-Oz prototype during a video-comprehension task. Our findings revealed benefits of our tool in terms of quality of video translation produced and perceived workload to produce translations. Our in-depth analysis also revealed benefits of an integrated search tool and use of span-selection to constrain video play. These findings inform future designers of such systems, computer vision researchers working on the underlying sign matching technologies, and sign language educators.

Caption text conveys salient auditory information to deaf or hard-of-hearing (DHH) viewers. Howev... more Caption text conveys salient auditory information to deaf or hard-of-hearing (DHH) viewers. However, the emotional information within the speech is not captured. We developed three emotive captioning schemas that map the output of audio-based emotion detection models to expressive caption text that can convey underlying emotions. The three schemas used typographic changes to the text, color changes, or both. Next, we designed a Unity framework to implement these schemas and used it to generate stimuli videos. In an experimental evaluation with 28 DHH viewers, we compared DHH viewers’ ability to understand emotions and their subjective judgments across the three captioning schemas. We found no significant difference in participants’ ability to understand the emotion based on the captions or their subjective preference ratings. Open-ended feedback revealed factors contributing to individual differences in preferences among the participants and challenges with automatically generated emotive captions that motivate future work.

Despite the recent improvements in automatic speech recognition (ASR) systems, their accuracy is ... more Despite the recent improvements in automatic speech recognition (ASR) systems, their accuracy is imperfect in live conversational settings. Classifying the importance of each word in a caption transcription can enable evaluation metrics that best reflect Deaf and Hard of Hearing (DHH) readers’ judgment of the caption quality. Prior work has proposed using word embeddings, e.g., word2vec or BERT embeddings, to model word importance in conversational transcripts. Recent work also disseminated a human-annotated word importance dataset. We conducted a word-token level analysis on this dataset and explored Part-of-Speech (POS) distribution. We then augmented the dataset with POS tags and reduced the class imbalance by generating 5\% additional text using masking. Finally, we investigated how various supervised models learn the importance of words. The best performing model trained on our augmented dataset performed better than prior models. Our findings can inform the design of a metric for measuring live caption quality from DHH users’ perspectives.

Captions blocking visual information in live television news leads to dissatisfaction among Deaf ... more Captions blocking visual information in live television news leads to dissatisfaction among Deaf and Hard of Hearing (DHH) viewers, who cannot see important information on the screen. Prior work has proposed generic guidelines for caption placement but not specifically for live television news, and important genre of television with dense placement of onscreen information regions, e.g., current news topic, scrolling news, etc. To understand DHH viewers’ gaze behavior while watching television news, both spatially and temporally, we conducted an eye-tracking study with 19 DHH participants. Participants’ gaze behavior varied over time as measured by their proportional fixation time on information regions on the screen. An analysis of gaze behavior coupled with open-ended feedback revealed four thematic categories of information regions. Our work motivates considering the time dimension when considering caption placement, to avoid blocking information regions, as their importance varies over time.
Searching unfamiliar American Sign Language (ASL) words in a dictionary is challenging for learne... more Searching unfamiliar American Sign Language (ASL) words in a dictionary is challenging for learners, as it involves recalling signs from memory and providing specific linguistic details. Fortunately, the emergence of sign-recognition technology will soon enable users to search by submitting a video of themselves performing the word. Although previous research has independently addressed algorithmic enhancements and design aspects of ASL dictionaries, there has been limited effort to integrate both. This paper presents the design of an end-to-end sign language dictionary system, incorporating design recommendations from recent human–computer interaction (HCI) research. Additionally, we share preliminary findings from an interview-based user study with four ASL learners.

PopSign is a smartphone-based bubble-shooter game that helps hearing parents of deaf infants lear... more PopSign is a smartphone-based bubble-shooter game that helps hearing parents of deaf infants learn sign language. To help parents practice their ability to sign, PopSign is integrating sign language recognition as part of its gameplay. For training the recognizer, we introduce the PopSign ASL v1.0 dataset that collects examples of 250 isolated American Sign Language (ASL) signs using Pixel 4A smartphone selfie cameras in a variety of environments. It is the largest publicly available, isolated sign dataset by number of examples and is the first dataset to focus on one-handed, smartphone signs. We collected over 210,000 examples at 1944x2592 resolution made by 47 consenting Deaf adult signers for whom American Sign Language is their primary language. We manually reviewed 217,866 of these examples, of which 175,022 (approximately 700 per sign) were the sign intended for the educational game. 39,304 examples were recognizable as a sign but were not the desired variant or were a different sign. We provide a training set of 31 signers, a validation set of eight signers, and a test set of eight signers. A baseline LSTM model for the 250-sign vocabulary achieves 82.1% accuracy (81.9% class-weighted F1 score) on the validation set and 84.2% (83.9% class-weighted F1 score) on the test set. Gameplay suggests that accuracy will be sufficient for creating educational games involving sign language recognition.
articles by Saad Hassan

Advances in sign-language recognition technology have enabled researchers to investigate various ... more Advances in sign-language recognition technology have enabled researchers to investigate various methods that can assist users in searching for an unfamiliar sign in ASL using sign-recognition technology. Users can generate a query by submitting a video of themselves performing the sign they believe they encountered somewhere and obtain a list of possible matches. However, there is disagreement among developers of such technology on how to report the performance of their systems, and prior research has not examined the relationship between the performance of search technology and users’ subjective judgements for this task. We conducted three studies using a Wizard-of-Oz prototype of a webcam-based ASL dictionary search system to investigate the relationship between the performance of such a system and user judgements. We found that, in addition to the position of the desired word in a list of results, the placement of the desired word above or below the fold and the similarity of the other words in the results list affected users’ judgements of the system. We also found that metrics that incorporate the precision of the overall list correlated better with users’ judgements than did metrics currently reported in prior ASL dictionary research.
Soon, smartphones may be capable of allowing American Sign Language (ASL) signing and/or fingersp... more Soon, smartphones may be capable of allowing American Sign Language (ASL) signing and/or fingerspelling for text entry. To explore the usefulness of this approach, we compared emulated fingerspelling recognition with a virtual keyboard for 12 Deaf participants. With practice, fingerspelling is faster (42.5 wpm), potentially has fewer errors (4.02\% corrected error rate) and higher throughput (14.2 bits/second), and is as desired as virtual keyboard texting (31.9 wpm; 6.46\% corrected error rate; 10.9 bits/second throughput). Our second study recruits another 12 Deaf users at the 2022 National Association for the Deaf conference to compare the walk-up usability of fingerspelling alone, signing, and virtual keyboard text entry for interacting with an emulated mobile assistant. Both signing and virtual keyboard text entry were preferred over fingerspelling.
Uploads
inproceedings by Saad Hassan
articles by Saad Hassan