SCOPE: Sign Language Contextual Processing with Embedding from LLMs
Yuqi Liu* , Wenqian Zhang* , Sihan Ren, Chengyu Huang, Jingyi Yu, Lan Xu†
School of Information Science and Technology, ShanghaiTech University
{liuyq2, zhangwq2022, rensh2022, huangchy, yujingyi, xulan1}@[Link]
arXiv:2409.01073v1 [[Link]] 2 Sep 2024
Figure 1: (a) Our SCOPE dataset contains rich contextual information and sign language videos. (b) Our SCOPE framework
is a robust context-aware sign language recognition/translation model capable of recognizing dialogue-based sign language
gestures, predicting glosses, and generating spoken sentences with the aid of LLMs.
Abstract pants from the Deaf community further validate the robust-
ness and effectiveness of our approach in real-world appli-
Sign languages, used by around 70 million Deaf individuals cations. Both our dataset and code will be open-sourced to
globally, are visual languages that convey visual and contex- facilitate further research.
tual information. Current methods in vision-based sign lan-
guage recognition (SLR) and translation (SLT) struggle with
dialogue scenes due to limited dataset diversity and the ne- Introduction
glect of contextually relevant information. To address these Sign language is the vital visual language used by the Deaf
challenges, we introduce SCOPE (Sign language Contextual
Processing with Embedding from LLMs), a novel context-
and hard of hearing. Hence, vision-based sign language un-
aware vision-based SLR and SLT framework. For SLR, we derstanding provides a communication bridge between the
utilize dialogue contexts through a multi-modal encoder to Deaf and hearing communities. Such a bridge should accu-
enhance gloss-level recognition. For subsequent SLT, we fur- rately and conveniently convey complex contextual informa-
ther fine-tune a Large Language Model (LLM) by incorpo- tion during communication between us humans, especially
rating prior conversational context. We also contribute a new for dialogue scenarios.
sign language dataset that contains 72 hours of Chinese sign Currently, the two main tasks in vision-based sign
language videos in contextual dialogues across various sce- language processing include Sign Language Recognition
narios. Experimental results demonstrate that our SCOPE (SLR) (Jiao et al. 2023; Wei and Chen 2023; Zheng et al.
framework achieves state-of-the-art performance on multi- 2023) and Sign Language Translation (SLT) (Zhao et al.
ple datasets, including Phoenix-2014T, CSL-Daily, and our
SCOPE dataset. Moreover, surveys conducted with partici-
2024; Chen et al. 2022b; Yin et al. 2023). SLR converts
visual signals into intermediate gloss sequences (Stokoe Jr
* The authors contributed equally 2005), while SLT translates visual signals or glosses into
†
Corresponding author natural language. Yet, we notice that most existing meth-
ods, both SLR and SLT, focus on translating one sen- provide a novel vision-based, context-driven sign language
tence at a time, largely ignoring the contextual informa- processing approach that utilizes LLMs to address SLR and
tion of dialogue scenes. Indeed, it’s mainly due to the se- SLT tasks in dialogue and communication settings. We also
vere lack of sign language datasets for dialogue scenes with contribute a large-scale contextual dataset of Chinese sign
sufficient contextual information. For example, the widely dialogues. We believe that both our dataset and baseline ap-
adopted PHOENIX2014(Koller, Forster, and Ney 2015) proach are the first of their kind to open up the research
and PHOENIX2014T datasets(Camgoz et al. 2018) focus direction towards context-aware and vision-based sign lan-
on weather forecasts. The How2Sign dataset(Duarte et al. guage analysis. Both our benchmark dataset and baseline ap-
2021) addresses everyday scenarios but only contains iso- proach will be made publicly available.
lated sign language sentences. The CSL-Daily dataset(Zhou
et al. 2021a) contains daily sentences but they lack preced- Related Works
ing or following context and are essentially still indepen-
dent statements. On the other hand, in the field of Natural Sign Language Recognition (SLR) focuses on recognizing
Language Processing, recent advances (Ouyang et al. 2022; glosses from sign videos. While progress has been made in
Touvron et al. 2023; qwe 2024) with large language models Isolated SLR (ISLR) (Albanie et al. 2020; Tunga, Nutha-
(LLMs) have demonstrated that contextual information can lapati, and Wachs 2021; Li et al. 2020c; Hu et al. 2021;
significantly improve semantic understanding and linguis- Li et al. 2020a), current research is shifting to Continuous
tic abilities. Some recent methods (Gong et al. 2024; Wong, SLR (CSLR), which converts continuous sign videos into
Camgöz, and Bowden 2024) utilize LLMs for sign language sentence-level gloss sequences. This task involves two main
understanding. Yet, they still focus on per-sentence transla- components: feature extraction and translating these features
tion and fall short of analyzing the contextual information into gloss sequences.
for dialogue scenarios. In a nutshell, both the dataset and Visual feature extraction often involves extracting fea-
methodology for contextual vision-based sign language pro- tures from RGB images using CNNs(Chen et al. 2022a;
cessing remain far-reaching. Li et al. 2020b; Hu et al. 2023c; Min et al. 2021). These
To this end, we introduce SCOPE, a contextual sign lan- features are then modeled with temporal frameworks like
guage recognition and translation approach tailored for the RNNs(Camgoz et al. 2018; Ko et al. 2019), LSTMs(Hu
dialogue scenes, as shown in Fig. 1 for overview. Specifi- et al. 2023a; Cui, Liu, and Zhang 2019), and Transform-
cally, we first contribute a context-based dataset of Chinese ers(Camgoz et al. 2020; Voskou et al. 2021; Yin and Read
sign language dialogues. Our dataset covers a wide range of 2020) to capture the connection between visual signals and
both daily and professional conversations like shopping and glosses. Some approaches(Zhou et al. 2021b; Papadimitriou
medical treatment. It includes 59,231 dialogue sequences and Potamianos 2020) utilize estimated keypoint sequences
totaling 72.4 hours. For each sequence, we provide video to describe motions through spatial coordinates or generate
footage, gloss annotations, and dialogue texts, all by profes- heatmaps(Chen et al. 2022b, 2024). However, many meth-
sional Deaf individuals from diverse backgrounds. ods require video processing, which can be slow and space-
Secondly, we provide a strong baseline for vision-based consuming, limiting their practical application.
contextual sign language processing, which organically uti- Decoding the extracted features into gloss sequences
lizes recent LLMs to extract the contextual information from needs temporal modeling. Hidden Markov Models
our unique dataset. For the SLR task, we extract sign mo- (HMMs)(Koller, Zargaran, and Ney 2017a; Gao et al. 2004;
tion features from the video footage and then introduce a Koller et al. 2016) and Connectionist Temporal Classifica-
novel embedding alignment module to align them to the con- tion (CTC)(Cheng et al. 2020; Zhou et al. 2021b; Min et al.
text embeddings from a frozen LLM. Then, we feed these 2021) are commonly used for this purpose. However, most
aligned motion/context embeddings into a gloss encoder to existing methods focus on frame-wise or sentence-wise
obtain the recognized gloss sequences. We observe that such information, often neglecting the broader linguistic context,
an alignment between the current motions and the preceding resulting in the loss of important language features.
contextual information from the LLM is crucial for perfor- Sign Language Translation (SLT) aims to translate sign
mance gain. It preserves both the motion and semantic in- language directly into natural language, bridging the gap
formation of the sign language while enabling the concate- between the Deaf community and hearing individuals. This
nation of the contextual embeddings with the input. For the task is challenging due to the modality gap between visual
subsequent SLT task, we further leverage the contextual un- signal and text, compounded by the scarcity of context sign
derstanding capabilities of the LLM. We use the gloss output language datasets. Many approaches(Camgoz et al. 2020;
from the previous SLR module and the contextual text as in- Zhou et al. 2021b,a) use SLR results to aid translation. Joint
puts and adopt Q-LoRA (Dettmers et al. 2023) to efficiently training of SLR and SLT modules has also been explored
fine-tune a pretrained LLM model, achieving accurate and to improve performance. Some researchers(Li et al. 2020b;
natural translations that are closely aligned with the context. Camgoz et al. 2018; Zhou et al. 2023) seek to eliminate
For validation, we conduct comprehensive experiments gloss by directly translating sign language videos into text
on both our unique contextual dataset and previously using techniques like Conditional Variational Autoencoders
context-free datasets and showcase a companion live demo and Transformers. SLT involves projecting visual features
for sign language translation, which demonstrates the state- into coherent textual representations, necessitating insights
of-the-art performance of our approach. In summary, we from both computer vision and natural language processing.
Figure 2: Overview of SCOPE framework. Our Embedding Alignment Encoder captures holistic linguistic information from
the whole motion sequence. Aligning embedding space to match a frozen LLM enables integrating previous context information
for SLR. Finally, Q-LoRA fine-tuning fits an LLM for translating predicted glosses with context into spoken language.
Key advancements leverage pretrained language models like Current corpora consist of context-independent sentences,
mBART(Liu et al. 2020) for enhanced textual understand- lacking the contextual relationships needed to fully utilize
ing(Chen et al. 2022b,a). Recent studies also explore the the linguistic features of sign language, which hinders ad-
use of frozen and fine-tuned large language models(Wong, vancements in SLT research.
Camgöz, and Bowden 2024; Gong et al. 2024) to improve
translation quality. Method
Sign Language Dataset. Progress in sign language research
We present SCOPE framework, a novel framework that
has been driven by data. Many researchers have contributed
aligns motion features with LLM-provided sentence embed-
valuable datasets of isolated signs (Wang et al. 2016; Zhang
dings of previous contexts, aiming to fully utilize contex-
et al. 2016; Joze and Koller 2018; Imashev et al. 2020; Srid-
tually related dialogues in which sign language conversa-
har et al. 2020; Li et al. 2020a; Sincan and Keles 2020; Al-
tions mainly occur. To address the often overlooked contex-
banie et al. 2020; Desai et al. 2024). However, while each
tual aspects in data collection, we provide SCOPE dataset
video clip corresponds to a single sign, the practical utility
that annotates sign videos with additional context informa-
of such data remains limited.
tion, which our model effectively utilizes. Details of SCOPE
There are several recent datasets that provide contin-
dataset will be further presented in the Dataset section.
uous sign language data. For instance, the PHOENIX-
Fig. 2 demonstrates the structure of our SCOPE frame-
2014(Koller, Forster, and Ney 2015) dataset includes sign
work. Our Embedding Alignment Encoder transforms mo-
language videos from television broadcasts along with cor-
tion features into an embedding that captures the linguistic
responding gloss annotations, primarily focusing on weather
information of the whole motion sequence. Aligning embed-
forecasts. Datasets like SIGNUM(von Agris, Knorr, and
ding space to a frozen LLM enables integrating contextual
Kraiss 2008), PHOENIX-2014T(Camgoz et al. 2018), and
information of previous sentences to recognize glosses. Fi-
CSL-Daily(Zhou et al. 2021a) not only offer gloss annota-
nally, Q-LoRA fine-tuning fits an LLM for translating pre-
tions but also include natural language translations of the
dicted glosses into spoken language with the assistance of
signs, thereby advancing Sign Language Translation (SLT)
context information.
research. Additionally, the CCSL(Huang et al. 2018) dataset
provides images with depth information, increasing the in-
formation of sign data. The How2Sign(Duarte et al. 2021)
Model Details
dataset stands out with its multi-view information, enabling Embedding Alignment Encoder. We use a transformer en-
the capture of 3D sign language motions. coder structure to extract information from motion features.
Despite improvements in the size and diversity of sign For the input keypoints J = J1 ...Jt , they first pass through
language datasets, they remain limited in domain coverage. the feature extractor linear layer and the temporal sequencer
linear layer, which compress the motion information in the where Hidden is the hidden layer embedding in the trans-
spatial and temporal dimensions, respectively, resulting in former encoder, and Cat is the concatenate operation. Et is
the intermediate state motion input D, which aligns textual the previous stage encoded sequence, and EA is the above
embedding in shape. text embedding vector. FFN is the feed-forward network in
Next, we need to pretrain an Embedding Alignment En- the transformer encoder. Passing the output Eout through a
coder to align features from the motion space with the tex- linear classifier layer, we get the output logic of the glosses.
tual embedding space. The key idea is to directly learn the CTC Decoding. We use connectionist temporal classifica-
alignment between the linguistic features of sign language tion (CTC) (Graves et al. 2006) loss to optimize the embed-
motion and the contextual features of text. In this step, we ding encoder:
aim to align the sign motion feature D with the embedding
vector of the target sentence. We do this by passing the mo- LyCTC = −logp(l|y) = −logπ∈B−1 (l) p(π|y), (4)
tion features through the Embedding Alignment transformer
encoder and then pooling them to compress the time dimen- where l = l1 ...lt is the gloss sequence corresponds to key-
sion, resulting in an embedding vector that matches the size points sequence J. B is a many-to-one mapping between
of the text embedding. The encoding process, in detail, first hypotheses and gloss, and π is the alignment path. In addi-
embeds the input D into a latent space, represented by h0 , tion, we adopt Minimum World Error Rate (MWER) Train-
and then obtains the encoded hidden states hn through N ing (Meng et al. 2021) technique to reduce the mismatch
attention layers. Finally, a feed-forward network is used to between training objectives and evaluation metrics, boost-
obtain the encoded vector. The formulas for the transformer ing the accuracy of hypotheses on top of the beam. We use
motion encoder process are as follows: beam search during training to decode the top 3 possible
gloss sequences. While maintaining the top 1 decoded re-
Q = W Q hi , K = W K hi , V = W V hi , sult as the final output of the SLR network, other candidate
QK T (1) glosses contribute to optimization with minimum word error
hi+1 = Attn(Q, K, V ) = softmax( √ )V, rate (MWER) loss:
C
N
where W Q ,W K ,W V are trainable weights, C is the number
X
LMWER = P̄ (Yn |J; θ)R(Yn , Y∗ ), (5)
of channels in the attention layer, and hi+1 is hidden states n=1
before the next layer.
n
Supervision by distance to the embedding of the target where P̄ (Yn |J; θ) = PNP (YP (Y |J;θ)
n |J;θ) , is the re-normalized
sentence provided by an LLM. n=1
The loss of the motion encoder is the L2 distance between posterior over the N-best hypotheses, θ is model parameters,
the pooled embedding vector and the target text embedding and R(Yn , Y∗ ) is the number of word errors in a hypothesis
vector. Yn compared to the reference Y∗ .
Lemb = E||Eout − Etext ||2 , (2) Furthermore, the top 3 decoded results also serve as the
input of the LLM model in SLT task.
where Lemb denotes the embedding loss, Eout is the output of Contextual LLM Fine-tuning. Inspired by (Gong et al.
the motion encoder, and Etext is the text embedding vector. 2024), we adopt the idea by using Q-LoRA to fine-tune an
The text embedding vector is generated by a frozen LLM LLM as a sign language translator. We adopted the Qwen2
text embedding model (Neelakantan et al. 2022), which en- LLM model as our translator. To fine-tune Qwen2, we need
codes the ground truth sentence meaning of the sign video. to set the LLM using the scenario as a ”Sign language
Through this process, we align the motion features with the translator” and design prompts to guide the model. In the
language feature information, enhancing the connections be- prompts, we provide the top 3 gloss sequences mentioned in
tween the isolated sign words. and all the above text related to the current sign language
Gloss Embedding encoder. After aligning the motion fea- sequence, and ask the LLM to summarize the top 3 glosses
tures, we obtain an embedding vector that contains both se- and guess the correct words to use by checking previous
mantic and sign language information. Next, we combine texts. We also provide some summarized task examples to
this with the motion features to predict the gloss. For sign help the LLM understand translation procedures. We use the
language conversations, providing previous language con- previous top 3 gloss outputs as input and use the designed
text is crucial for improving the accuracy of recognizing prompt along with the above text as auxiliary input, jointly
the current target sentence. Therefore, we use a frozen LLM fine-tuning the LLM model. We optimize the model using
to get the embedding vector for the previous sentences. To the cross-entropy loss function:
minimize irrelevant information, we only keep the last three X
text embeddings. If there are fewer than three previous sen- Lllm = − Ŷt (i) log(Yt (i)), i ∈ Ntok . (6)
tences, we use a mask to ignore the padding input. The gloss i
embedding encoder is also a transformer encoder model.
The encoding process can be formulated as follows: Ŷt is the ground truth textual output, and Yt is the predicted
0 textual output. Ntok is the number of classes in the tokenizer.
Ht,A = Hidden(Cat(Et , EA )),
(3)
Eout = FFN(Attn(W Q Ht,A
N
, W K Ht,A
N
, W V Ht,A
N
)),
Figure 3: SCOPE dataset collection pipeline. Given dialogue texts, experienced signers produce corresponding sign videos
along with self-annotated glosses. For each video, other signers replicate data based on the glosses and the text.
Dataset Language Videos Duration(h) Signers Vocab Gloss Text Dialogue Source
PHOENIX-2014 DGS 6,841 8.9 9 1k (German) ✓ × × TV
PHOENIX-2014T DGS 8,257 11 9 3k (German) ✓ ✓ × TV
CSL-Daily CSL 20,654 22.8 10 2k (Chinese) ✓ ✓ × Lab
How2Sign ASL 35,191 79 11 16k(English) ✓ ✓ × Lab
Ours CSL 59,231 72.4 12 5k (Chinese) ✓ ✓ ✓ Lab
Table 1: Dataset comparisons. Key statistics of widely used sign language datasets for comparison. Our dataset is currently
the largest dataset in CSL (Chinese Sign Language) that contains dialogue context information.
Data Processing standard distribution, which eliminates the difficulties that
Iris Normalization. To fetch keypoint sequences, we uti- motion corner cases bring to training:
lized DWPose (Yang et al. 2023) to obtain whole-body key- n P
P t
points (COCO-WholeBody (Jin et al. 2020)) from sign lan- Ji I
guage videos. Each keyframe contains 133 keypoints J = Jistd = N (Ji − , ), i = 1...77, (9)
N × T Jistd
{J1,i , ..., JT,i |i = 1...133}. However, such keypoints are
often influenced by the input video resolution and the dis- where Ji denotes the i − th joints, and J std is the standard
tance between the person and the camera. A scaling process deviation of joint i.
is needed to mitigate the impact of input distortions on mo-
tion. Inspired by (Lugaresi et al. 2019), we choose the length SCOPE dataset
of the lower eyelid in humans as the golden standard, com-
A sign language dataset with contextual information is es-
paring the eyelid length differences to get the scale factor
sential to fully leverage the power of context in implement-
and scale motions to the standard size:
ing our context-aware approaches. We propose SCOPE,
(Jx , Jy ) a large-scale Chinese sign language dataset that includes
Jtscaled = , (7)
|(Jx63 − Jx64 )| contextual dialogue information. Our data information and
Where JtScaled are scaled joints under frame t, (Jx , Jy ) are dataset comparison can be found in Tab.1.
2D coordinates of joints. Jx63 − Jx64 is eyelid length; 63 and Data Collection. Our dataset primarily focuses on daily
64 are indexes of left and right wings of the eyelid in COCO- conversations within the Deaf community, as well as dia-
WholeBody. logues involving specialized terminology in more profes-
Data Centralizing. After that, we followed (Jiao et al. 2023) sional settings(Bragg et al. 2019). Our dataset includes daily
by selecting 77 keypoints and dividing them into 5 groups, subjects such as school experiences, shopping, and social in-
then applied group-specific centralization to decouple multi- teractions. Glosses encompass specific products and brands,
grained motion information in skeleton data: titles of audiovisual works, and other relevant terms. For
more details, please refer to the supplementary material.
Jt,k = Jt,k − Jt,rg , k ∈ G, (8) Data collection is carried out by a team whose primary
where Jt,k denotes joints under the t frame, group k, G are members are several professional Deaf signers and three
5 groups, and rg is the root keypoint of group g. sign language linguistics experts. The team also includes
Data Standardizing. Finally, we standardize all input mo- a diverse group of Deaf individuals across various ages,
tions to make their distribution more closely conform to a genders, occupations, and educational backgrounds to cap-
ture diverse signing styles. To ensure a natural dialogue en- encoder are both 8-head transformer encoders with 2 and 4
vironment, each sentence was derived from conversations layers, respectively, with hidden size 1568 and feed-forward
recorded in real situations. size 3136. We adopt the AdamW optimizer and use cosine
Fig.3 illustrates our data collection pipeline. Professional annealing schedules, with 20 epochs focusing on alignment
Deaf signers receive reference sentences and record corre- embedding, and 60 epochs for gloss encoder training while
sponding sign language videos. Capable of self-annotating keeping the previous module frozen. When training with-
recorded motion into glosses, they produce gloss annota- out the context module, we do not provide context informa-
tions that are distributed to other signers. With sentences and tion by filling context embeddings with zeros and providing
glosses as references, other signers replicate data with di- empty context input for LLM. All experiments are executed
verse signing habits and styles. We ensure that four different on 8 NVIDIA A800 GPUs. More implementation details are
signers record videos for each piece of text at a resolution of provided in the supplementary materials.
640x480 and a frame rate of 30 frames per second.
Annotation Cleaning and Validation. Self-annotated Comparison with State-of-the-art Methods
glosses still suffer from inconsistencies across different an- Sign Language Recognition (SLR) We evaluate our ap-
notators. A same sign is sometimes annotated with syn- proach by comparing results on multiple datasets with re-
onyms, while a sequence of signs may get interpreted into cent methods SEN-CSLR(Hu et al. 2023c), TwoStream-
phrases or separated words. To mitigate such issues, we fol- SLR(Chen et al. 2022b) and CorrNet(Hu et al. 2023b).
low CSL-Daily(Zhou et al. 2021a) to apply a multi-round On our SCOPE dataset, we evaluate their performance by
data cleaning process with our SCOPE SLR model. training their open-sourced framework. We perform prepro-
Particularly, we compute Minimum Edit Distance (MED) cessing to match the input specifications of each method and
between predicted glosses and ground truth from annotation. train their models adhering as closely as possible to their
The results enable us to identify patterns of synonyms, word proposed training setups.
division and word combination. Our sign language linguis- As shown in Tab.2, our context-free SCOPE outperforms
tics experts then examine frequently confused patterns and other SLR methods in WER by 2.7%/2.2% on CSL-Daily
correct misannotated data. We iterate such a process to re- dev/test sets and 3.3%/3.1% on SCOPE dataset, respectively.
duce our gloss vocabulary size from over 7k to 5k, signifi- Moreover, adding context information further improves our
cantly improving the dataset’s quality. model’s recognition accuracy by 2.2%/3.3% WER, reveal-
ing that contextual understanding effectively assists gloss
Experiments recognition.
Sign Language Translation (SLT) On the SLT task, we
Experimental Setup
compare our approach with state-of-the-art gloss-supervised
Datasets and Evaluation Metrics. For the SLR task, and gloss-free methods. Similarly, we stick to their respec-
we evaluate our proposed method on PHOENIX14, tive training configurations when training their models on
PHOENIX14-T, CSL-Daily, and SCOPE dataset; the latter SCOPE dataset. Results in Tab.3 show that our approach
three datasets are also utilized in experiments on SLT task. outperforms previous methods by +3.73/+3.57 BLEU and
For the number of samples and other details of datasets, +3.50/+3.14 BLEU in Phoenix-2014T and CSL-Daily de-
please refer to supplementary materials. Train/dev/test splits v/test sets. Additionally, our full approach on SCOPE
of the existing datasets are maintained. For our SCOPE dataset brings another +3.26/+2.79 BLEU improvement,
dataset, we follow (Zhang et al. 2024) to use widely adopted which we attribute mainly to context-aware LLM fine-
split ratios to randomly split our dataset by 80%, 5% and tuning. Notably, when comparing across datasets, SCOPE
15% into train, dev, and test sets, carefully ensuring that no dataset generally yields better performance for any fixed
same sentence appears in different sets and any sentence in method. We primarily attribute this result to our robust data
the dev set or test set does not appear in context dialogues of annotation and cleaning process.
the training set.
Following previous works (Chen et al. 2022b), we adopt Ablation Studies
the Word Error Rate (WER), which measures the percentage We conduct ablation experiments for both SLR and SLT
of incorrect words in recognized text, for SLR, and BLEU tasks to validate the contributions of each component. The
(Papineni et al. 2002) and ROUGE-L (Lin 2004), which comparison between our full and context-free SCOPE model
assess the quality of translations based on n-gram overlap also suffices as an ablation study to demonstrate the signif-
and longest common subsequences, for SLT as evaluation icance of context information, both in recognition and in
metrics. Lower WER indicates more accurate recognition LLM fine-tuning. When the embedding alignment encoder
results, while higher BLEU and ROUGE-L signify better is removed, the context sentence embeddings are concate-
translations. nated to motion features directly, and Lemb no longer serves
Implementation Details. We obtain sentence embeddings as a supervision term. The performance of this model de-
by OpenAI’s text-embedding-ada-002 (Neelakantan et al. clines by 4.4%WER and 16.01 BLEU, and we note that it
2022) model. Body 2D keypoints are collected from videos takes significantly longer for this model to converge. Thus,
using DWPose (Yang et al. 2023). Our motion feature ex- we deduce that the model encounters difficulties in align-
tractor block consists of a 2-layer MLP with a temporal ing motion features with LLM context embeddings and ul-
Conv1D layer. The embedding alignment encoder and gloss timately behaves poorly. The removal of LMWER directly
Phoenix-2014 Phoenix-2014T CSL-Daily SCOPE
Method
Dev Test Dev Test Dev Test Dev Test
SEN-CSLR(Hu et al. 2023c) 19.5 20.9 19.3 20.7 31.1 30.7 40.2 41.1
TwoStream-SLR(Chen et al. 2022b) 18.4 18.8 17.7 19.3 25.4 25.3 40.8 40.5
CorrNet(Hu et al. 2023b) 18.9 19.7 18.9 20.5 30.6 30.1 33.5 33.8
Ours-SLR∗ w/o Context 18.8 19.2 17.8 19.0 22.7 23.1 30.2 30.7
Ours-SLR∗ - - - - - - 28.0 27.4
Table 2: Quantitative evaluation of Sign Language Recognition (SLR) task. WER is adopted as the evaluation metric. We
train other methods and ours on our SCOPE dataset. Also, our model without context input is evaluated on other popular
datasets. The red and blue entries indicate the best and the second best results.
Dev Test
Dataset Method
R↑ B1↑ B2↑ B3↑ B4↑ R↑ B1↑ B2↑ B3↑ B4↑
MMTLB-S2T(Chen et al. 2022a) 53.10 53.95 41.12 33.14 27.61 52.65 53.97 41.75 33.84 28.39
TwoStream-S2T(Chen et al. 2022b) 54.08 54.32 41.99 34.15 28.66 53.48 54.90 42.43 34.46 28.95
P-2014T CV-SLT(Zhao et al. 2024) 54.43 55.09 42.60 34.63 29.10 54.33 54.88 42.68 34.79 29.27
Ours∗ w/o Context 67.09 61.80 49.09 39.53 32.83 60.06 61.74 49.22 39.61 32.84
MMTLB-S2T(Chen et al. 2022a) 53.38 53.81 40.84 31.29 24.42 53.25 53.31 40.41 30.87 23.92
TwoStream-S2T(Chen et al. 2022b) 55.10 55.21 42.31 32.71 25.76 55.72 55.44 42.59 32.87 25.59
CSL-Daily CV-SLT(Zhao et al. 2024) 56.36 58.05 44.73 35.14 28.24 57.06 58.29 45.15 35.77 28.94
Ours∗ w/o Context 60.18 60.37 47.21 37.36 31.74 60.68 60.48 49.61 40.01 32.08
MMTLB-S2T(Chen et al. 2022a) 63.25 60.72 50.33 40.39 31.61 64.30 61.69 51.75 41.98 33.56
TwoStream-S2T(Chen et al. 2022b) 63.40 60.87 50.48 40.74 31.65 64.30 61.78 51.86 42.17 33.50
SCOPE CV-SLT(Zhao et al. 2024) 65.71 63.16 52.00 43.69 37.10 66.06 62.69 52.12 44.14 37.82
Ours∗ w/o Context 69.34 64.31 53.15 43.57 34.83 69.46 64.62 53.64 44.13 35.80
Ours∗ 69.78 65.68 55.06 46.18 38.09 70.14 65.85 55.42 46.56 38.59
Table 3: Quantitative evaluation of Sign Language Translation (SLT) task. (R: ROUGE, B: BLEU) We train other methods
on our dataset, our method on all three datasets. For non-context-based data, we train our method without context. The red and
blue entries indicate the best and the second-best results.
causes more word errors, thus deteriorating the translation
results. The distribution of raw keypoints is severely biased SLR SLT
Ablation Study
without our Iris Normalization process, rendering the model
overfit to extreme cases and unsuitable for real-time practi-
WER↓ R↑ B4↑
cal use with different aspect ratios and camera resolutions. ∗
Full SCOPE 27.4 70.14 38.59
w/o Context 30.7 69.46 35.80
w/o Embedding Encoder 31.8 51.77 22.58
Real-time Application and User Studies
w/o LMWER 37.6 48.64 15.55
w/o Iris Normalization 35.8 51.51 21.76
Authentic feedback from the Deaf community is the gold
standard for practical use. We have developed a real-time
SLT application to assist Deaf individuals in accessing den- Table 4: Ablation studies of our contextual design and
tal care. Details are provided in the supplementary materials. data processing algorithm.
Authentic feedback from the Deaf community is the gold
standard for practical use. We have developed a real-time Conclusion
SLT application to assist Deaf individuals. Details are pro- We present the SCOPE dataset, the first dialogue-based Chi-
vided in the supplementary materials. We conducted a sur- nese Sign Language dataset featuring both gloss and text
vey on their user experience, and questions concerning ran- annotations. This dataset encompasses 72.4 hours of sign
dom SLR or SLT results. We have collected 40 responses, language videos collected from professional Deaf groups,
rating our application user experience as 4.15 / 5 on aver- complemented by 59,231 text annotations. Building on
age, accuracy of SLR results as 3.96 / 5, and SLT results this dataset, we introduce the SCOPE framework, a ro-
as 3.98 / 5. These ratings indicate a positive response from bust pipeline specifically designed to address Sign Language
the Deaf community, providing strong evidence of our re- Recognition (SLR) and Sign Language Translation (SLT)
search’s effectiveness. tasks with rich contextual information. Our comprehensive
evaluations demonstrate the effectiveness of our methods Cui, R.; Liu, H.; and Zhang, C. 2019. A deep neural frame-
and the significant improvements enabled by our dataset for work for continuous sign language recognition by iterative
the sign language community. We believe that SCOPE will training. IEEE Transactions on Multimedia, 21(7): 1880–
catalyze future research in context-based sign language pro- 1891.
cessing. Desai, A.; Berger, L.; Minakov, F.; Milano, N.; Singh, C.;
Pumphrey, K.; Ladner, R.; Daumé III, H.; Lu, A. X.; Caselli,
N.; et al. 2024. ASL citizen: a community-sourced dataset
References for advancing isolated sign language recognition. Advances
2024. Qwen2 Technical Report. in Neural Information Processing Systems, 36.
Ahn, J.; Jang, Y.; and Chung, J. S. 2024. Slowfast Network Dettmers, T.; Pagnoni, A.; Holtzman, A.; and Zettlemoyer,
for Continuous Sign Language Recognition. In IEEE Inter- L. 2023. QLoRA: efficient finetuning of quantized LLMs
national Conference on Acoustics, Speech and Signal Pro- (2023). arXiv preprint arXiv:2305.14314, 52: 3982–3992.
cessing, 3920–3924.
Duarte, A.; Palaskar, S.; Ventura, L.; Ghadiyaram, D.; De-
Albanie, S.; Varol, G.; Momeni, L.; Afouras, T.; Chung, Haan, K.; Metze, F.; Torres, J.; and Giro-i Nieto, X. 2021.
J. S.; Fox, N.; and Zisserman, A. 2020. BSL-1K: Scaling How2sign: a large-scale multimodal dataset for continuous
up co-articulated sign language recognition using mouthing american sign language. In Proceedings of the IEEE/CVF
cues. In Computer Vision–ECCV 2020: 16th European Con- conference on computer vision and pattern recognition,
ference, Glasgow, UK, August 23–28, 2020, Proceedings, 2735–2744.
Part XI 16, 35–53. Springer.
Gao, W.; Fang, G.; Zhao, D.; and Chen, Y. 2004. A Chi-
Bragg, D.; Koller, O.; Bellard, M.; Berke, L.; Boudreault, P.; nese sign language recognition system based on SOFM/S-
Braffort, A.; Caselli, N.; Huenerfauth, M.; Kacorri, H.; Ver- RN/HMM. Pattern Recognition, 37(12): 2389–2402.
hoef, T.; et al. 2019. Sign language recognition, generation,
and translation: An interdisciplinary perspective. In Pro- Gong, J.; Foo, L. G.; He, Y.; Rahmani, H.; and Liu, J. 2024.
ceedings of the 21st International ACM SIGACCESS Con- Llms are good sign language translators. In Proceedings of
ference on Computers and Accessibility, 16–31. the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 18362–18372.
Camgoz, N. C.; Hadfield, S.; Koller, O.; Ney, H.; and Bow-
den, R. 2018. Neural sign language translation. In Proceed- Graves, A.; Fernández, S.; Gomez, F.; and Schmidhuber, J.
ings of the IEEE conference on computer vision and pattern 2006. Connectionist temporal classification: labelling un-
recognition, 7784–7793. segmented sequence data with recurrent neural networks.
In Proceedings of the 23rd international conference on Ma-
Camgoz, N. C.; Koller, O.; Hadfield, S.; and Bowden, R. chine learning, 369–376.
2020. Sign language transformers: Joint end-to-end sign
language recognition and translation. In Proceedings of Hao, A.; Min, Y.; and Chen, X. 2021. Self-mutual distilla-
the IEEE/CVF conference on computer vision and pattern tion learning for continuous sign language recognition. In
recognition, 10023–10033. Proceedings of the IEEE/CVF international conference on
computer vision, 11303–11312.
Chen, H.; Wang, J.; Guo, Z.; Li, J.; Zhou, D.; Wu, B.;
Guan, C.; Chen, G.; and Heng, P.-A. 2024. SignVTCL: Hu, H.; Zhao, W.; Zhou, W.; Wang, Y.; and Li, H. 2021.
Multi-Modal Continuous Sign Language Recognition En- SignBERT: Pre-training of hand-model-aware representa-
hanced by Visual-Textual Contrastive Learning. CoRR, tion for sign language recognition. In Proceedings of the
abs/2401.11847. IEEE/CVF international conference on computer vision,
11087–11096.
Chen, Y.; Wei, F.; Sun, X.; Wu, Z.; and Lin, S. 2022a. A sim-
ple multi-modality transfer learning baseline for sign lan- Hu, L.; Gao, L.; Liu, Z.; and Feng, W. 2023a. Continuous
guage translation. In Proceedings of the IEEE/CVF Con- sign language recognition with correlation network. In Pro-
ference on Computer Vision and Pattern Recognition, 5120– ceedings of the IEEE/CVF Conference on Computer Vision
5130. and Pattern Recognition, 2529–2539.
Chen, Y.; Zuo, R.; Wei, F.; Wu, Y.; Liu, S.; and Mak, B. Hu, L.; Gao, L.; Liu, Z.; and Feng, W. 2023b. Continuous
2022b. Two-Stream Network for Sign Language Recogni- sign language recognition with correlation network. In Pro-
tion and Translation. NeurIPS. ceedings of the IEEE/CVF Conference on Computer Vision
Cheng, K. L.; Yang, Z.; Chen, Q.; and Tai, Y.-W. 2020. Fully and Pattern Recognition, 2529–2539.
convolutional networks for continuous sign language recog- Hu, L.; Gao, L.; Liu, Z.; and Feng, W. 2023c. Self-
nition. In Computer Vision–ECCV 2020: 16th European emphasizing network for continuous sign language recog-
Conference, Glasgow, UK, August 23–28, 2020, Proceed- nition. In Proceedings of the AAAI Conference on Artificial
ings, Part XXIV 16, 697–714. Springer. Intelligence, volume 37, 854–862.
Cihan Camgoz, N.; Hadfield, S.; Koller, O.; and Bowden, R. Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; and Li, W. 2018.
2017. Subunets: End-to-end hand shape and continuous sign Video-based sign language recognition without temporal
language recognition. In Proceedings of the IEEE interna- segmentation. In Proceedings of the AAAI Conference on
tional conference on computer vision, 3056–3065. Artificial Intelligence, volume 32.
Imashev, A.; Mukushev, M.; Kimmelman, V.; and Lin, C.-Y. 2004. Rouge: A package for automatic evaluation
Sandygulova, A. 2020. A Dataset for Linguistic Un- of summaries. In Text summarization branches out, 74–81.
derstanding, Visual Evaluation, and Recognition of Sign Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvinine-
Languages: The K-RSL. In Fernández, R.; and Linzen, T., jad, M.; Lewis, M.; and Zettlemoyer, L. 2020. Multilingual
eds., Proceedings of the 24th Conference on Computational denoising pre-training for neural machine translation. Trans-
Natural Language Learning, 631–640. Online: Association actions of the Association for Computational Linguistics, 8:
for Computational Linguistics. 726–742.
Jiao, P.; Min, Y.; Li, Y.; Wang, X.; Lei, L.; and Chen, X.
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja,
2023. CoSign: Exploring co-occurrence signals in skeleton-
E.; Hays, M.; Zhang, F.; Chang, C.-L.; Yong, M. G.; Lee, J.;
based continuous sign language recognition. In Proceedings
et al. 2019. Mediapipe: A framework for building perception
of the IEEE/CVF International Conference on Computer Vi-
pipelines. arXiv preprint arXiv:1906.08172.
sion, 20676–20686.
Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, Meng, Z.; Wu, Y.; Kanda, N.; Lu, L.; Chen, X.; Ye, G.; Sun,
W.; and Luo, P. 2020. Whole-Body Human Pose Estimation E.; Li, J.; and Gong, Y. 2021. Minimum word error rate
in the Wild. In Proceedings of the European Conference on training with language model fusion for end-to-end speech
Computer Vision (ECCV). recognition. arXiv preprint arXiv:2106.02302.
Joze, H. R. V.; and Koller, O. 2018. Ms-asl: A large-scale Min, Y.; Hao, A.; Chai, X.; and Chen, X. 2021. Visual align-
data set and benchmark for understanding american sign lan- ment constraint for continuous sign language recognition. In
guage. arXiv preprint arXiv:1812.01053. Proceedings of the IEEE/CVF international conference on
Ko, S.-K.; Kim, C. J.; Jung, H.; and Cho, C. 2019. Neural computer vision, 11542–11551.
sign language translation based on human keypoint estima- Neelakantan, A.; Xu, T.; Puri, R.; Radford, A.; Han, J. M.;
tion. Applied sciences, 9(13): 2683. Tworek, J.; Yuan, Q.; Tezak, N.; Kim, J. W.; Hallacy, C.;
Koller, O.; Camgoz, N. C.; Ney, H.; and Bowden, R. 2019. et al. 2022. Text and code embeddings by contrastive pre-
Weakly supervised learning with multi-stream CNN-LSTM- training. arXiv preprint arXiv:2201.10005.
HMMs to discover sequential parallelism in sign language Niu, Z.; and Mak, B. 2020. Stochastic fine-grained label-
videos. IEEE transactions on pattern analysis and machine ing of multi-state sign glosses for continuous sign language
intelligence, 42(9): 2306–2320. recognition. In Computer Vision–ECCV 2020: 16th Euro-
Koller, O.; Forster, J.; and Ney, H. 2015. Continuous sign pean Conference, Glasgow, UK, August 23–28, 2020, Pro-
language recognition: Towards large vocabulary statistical ceedings, Part XVI 16, 172–186. Springer.
recognition systems handling multiple signers. Computer Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.;
Vision and Image Understanding, 141: 108–125. Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.;
Koller, O.; Zargaran, S.; and Ney, H. 2017a. Re-Sign: Re- et al. 2022. Training language models to follow instructions
Aligned End-to-End Sequence Modelling with Deep Recur- with human feedback. Advances in neural information pro-
rent CNN-HMMs. In 2017 IEEE Conference on Computer cessing systems, 35: 27730–27744.
Vision and Pattern Recognition (CVPR), 3416–3424. Papadimitriou, K.; and Potamianos, G. 2020. Multimodal
Koller, O.; Zargaran, S.; and Ney, H. 2017b. Re-sign: Re- Sign Language Recognition via Temporal Deformable Con-
aligned end-to-end sequence modelling with deep recurrent volutional Sequence Learning. In Interspeech, 2752–2756.
CNN-HMMs. In Proceedings of the IEEE conference on Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
computer vision and pattern recognition, 4297–4305. Bleu: a method for automatic evaluation of machine trans-
Koller, O.; Zargaran, S.; Ney, H.; and Bowden, R. 2016. lation. In Proceedings of the 40th annual meeting of the
Deep Sign: Hybrid CNN-HMM for Continuous Sign Lan- Association for Computational Linguistics, 311–318.
guage Recognition. In BMVC, 136–1.
Pu, J.; Zhou, W.; Hu, H.; and Li, H. 2020. Boosting con-
Li, D.; Rodriguez, C.; Yu, X.; and Li, H. 2020a. Word-level tinuous sign language recognition via cross modality aug-
deep sign language recognition from video: A new large- mentation. In Proceedings of the 28th ACM international
scale dataset and methods comparison. In Proceedings of the conference on multimedia, 1497–1505.
IEEE/CVF winter conference on applications of computer
vision, 1459–1469. Pu, J.; Zhou, W.; and Li, H. 2019. Iterative alignment net-
work for continuous sign language recognition. In Proceed-
Li, D.; Xu, C.; Yu, X.; Zhang, K.; Swift, B.; Suominen,
ings of the IEEE/CVF conference on computer vision and
H.; and Li, H. 2020b. Tspnet: Hierarchical feature learn-
pattern recognition, 4165–4174.
ing via temporal semantic pyramid for sign language trans-
lation. Advances in Neural Information Processing Systems, Sincan, O. M.; and Keles, H. Y. 2020. AUTSL: A Large
33: 12034–12045. Scale Multi-Modal Turkish Sign Language Dataset and
Li, D.; Yu, X.; Xu, C.; Petersson, L.; and Li, H. 2020c. Baseline Methods. IEEE Access, 8: 181340–181355.
Transferring cross-domain knowledge for video sign lan- Sridhar, A.; Ganesan, R. G.; Kumar, P.; and Khapra, M.
guage recognition. In Proceedings of the IEEE/CVF Con- 2020. Include: A large scale dataset for indian sign language
ference on Computer Vision and Pattern Recognition, 6205– recognition. In Proceedings of the 28th ACM international
6214. conference on multimedia, 1366–1375.
Stokoe Jr, W. C. 2005. Sign language structure: An outline the AAAI Conference on Artificial Intelligence, volume 38,
of the visual communication systems of the American deaf. 19643–19651.
Journal of deaf studies and deaf education, 10(1): 3–37. Zheng, J.; Wang, Y.; Tan, C.; Li, S.; Wang, G.; Xia, J.;
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, Chen, Y.; and Li, S. Z. 2023. Cvt-slr: Contrastive visual-
M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; textual transformation for sign language recognition with
Azhar, F.; et al. 2023. Llama: Open and efficient founda- variational alignment. In Proceedings of the IEEE/CVF con-
tion language models. arXiv preprint arXiv:2302.13971. ference on computer vision and pattern recognition, 23141–
Tunga, A.; Nuthalapati, S. V.; and Wachs, J. 2021. Pose- 23150.
based sign language recognition using GCN and BERT. In Zhou, B.; Chen, Z.; Clapés, A.; Wan, J.; Liang, Y.; Escalera,
Proceedings of the IEEE/CVF winter conference on appli- S.; Lei, Z.; and Zhang, D. 2023. Gloss-free sign language
cations of computer vision, 31–40. translation: Improving from visual-language pretraining. In
von Agris, U.; Knorr, M.; and Kraiss, K.-F. 2008. The signif- Proceedings of the IEEE/CVF International Conference on
icance of facial features for automatic sign language recog- Computer Vision, 20871–20881.
nition. In 2008 8th IEEE International Conference on Auto- Zhou, H.; Zhou, W.; Qi, W.; Pu, J.; and Li, H. 2021a. Im-
matic Face & Gesture Recognition, 1–6. proving sign language translation with monolingual data by
Voskou, A.; Panousis, K. P.; Kosmopoulos, D.; Metaxas, sign back-translation. In Proceedings of the IEEE/CVF Con-
D. N.; and Chatzis, S. 2021. Stochastic transformer net- ference on Computer Vision and Pattern Recognition, 1316–
works with linear competing units: Application to end-to- 1325.
end sl translation. In Proceedings of the IEEE/CVF Interna- Zhou, H.; Zhou, W.; Zhou, Y.; and Li, H. 2021b. Spatial-
tional Conference on Computer Vision, 11946–11955. temporal multi-cue network for sign language recognition
Wang, H.; Chai, X.; Hong, X.; Zhao, G.; and Chen, X. 2016. and translation. IEEE Transactions on Multimedia, 24: 768–
Isolated sign language recognition with grassmann covari- 779.
ance matrices. ACM Transactions on Accessible Computing Zuo, R.; and Mak, B. 2022a. C2slr: Consistency-enhanced
(TACCESS), 8(4): 1–21. continuous sign language recognition. In Proceedings of
Wei, F.; and Chen, Y. 2023. Improving continuous sign lan- the IEEE/CVF Conference on Computer Vision and Pattern
guage recognition with cross-lingual signs. In Proceedings Recognition, 5131–5140.
of the IEEE/CVF International Conference on Computer Vi- Zuo, R.; and Mak, B. 2022b. Local Context-aware Self-
sion, 23612–23621. attention for Continuous Sign Language Recognition. Proc.
Wong, R. C.; Camgöz, N. C.; and Bowden, R. 2024. Interspeech 2022, 4810–4814.
SIGN2GPT: leveraging large language models for gloss-free
sign language translation. In ICLR 2024: The Twelfth Inter- Appendix
national Conference on Learning Representations.
SCOPE Dataset details
Yang, Z.; Zeng, A.; Yuan, C.; and Li, Y. 2023. Effective
whole-body pose estimation with two-stages distillation. In Data Collection Details The SCOPE dataset was meticu-
Proceedings of the IEEE/CVF International Conference on lously curated to ensure high-quality and diverse sign lan-
Computer Vision, 4210–4220. guage data. The data collection process involved several
Yin, A.; Zhong, T.; Tang, L.; Jin, W.; Jin, T.; and Zhao, Z. steps and considerations:
2023. Gloss attention for gloss-free sign language transla- • Participants: Our participants included both profes-
tion. In Proceedings of the IEEE/CVF conference on com- sional Deaf sign language teachers and non-professional
puter vision and pattern recognition, 2551–2562. Deaf individuals. The professional signers comprised
Yin, K.; and Read, J. 2020. Better Sign Language Trans- three sign language linguistics experts and several ex-
lation with STMC-Transformer. In Proceedings of the perienced Deaf signers. Non-professional signers were
28th International Conference on Computational Linguis- selected to represent a variety of ages, genders, occu-
tics, 5975–5989. pations, and educational backgrounds, capturing a wide
Zhang, J.; Zhou, W.; Xie, C.; Pu, J.; and Li, H. 2016. Chi- range of signing habits.
nese sign language recognition with adaptive HMM. In • Location: Data collection sessions were conducted in a
2016 IEEE international conference on multimedia and expo controlled environment designed to mimic real-life sce-
(ICME), 1–6. IEEE. narios. This setting was equipped with high-resolution
Zhang, W.; Huang, M.; Zhou, Y.; Zhang, J.; Yu, J.; Wang, J.; cameras and appropriate lighting to ensure clarity and ac-
and Xu, L. 2024. BOTH2Hands: Inferring 3D Hands from curacy in the captured sign language videos.
Both Text Prompts and Body Dynamics. In Proceedings of • Collection Process:
the IEEE/CVF Conference on Computer Vision and Pattern 1. Professional signers were provided with reference sen-
Recognition, 2393–2404. tences in natural language and asked to perform the
Zhao, R.; Zhang, L.; Fu, B.; Hu, C.; Su, J.; and Chen, Y. corresponding sign language. These sessions were
2024. Conditional variational autoencoder for sign language recorded, and the signers annotated the videos to cre-
translation with cross-modal alignment. In Proceedings of ate initial gloss annotations.
2. The annotated videos were reviewed by our sign Scene Statistics Clip Num Time
language linguistics experts to ensure accuracy and Medical treatment 1,3265 18.24h
consistency. Any discrepancies identified during this
phase were corrected.
Workplace 12,597 17.91h
Entertainment 10,391 15.10h
3. Non-professional signers were given the gloss annota-
tions and reference sentences to replicate the sign lan-
Family 10,775 10.68h
guage videos. This step ensured the inclusion of di- Education 10,740 8.46h
verse signing styles and habits. Shopping 1,457 2.01h
4. Each reference sentence was performed by four dif- Total 59,231 72.40h
ferent signers, resulting in multiple video samples per Table 5: Dataset Statistics. Scene Statistics classified sce-
sentence to enhance the richness of the dataset. nario appeared in the dataset, and calculate their total length.
• Data Cleaning: Our data cleaning process consists of
two main steps. Experiment Details
In this section, we provide an in-depth discussion of the ex-
1. Three sign language linguistics experts reviewed all
periment’s details, including input data quality, preparation
gloss annotations. Utilizing their expertise, they iden-
specifics, and additional evaluation comparisons.
tified equivalent gloss combinations, including one-
to-one, one-to-many, and many-to-one relationships. Input Details We employed the state-of-the-art Pose Esti-
One-to-one relationships were generally synonymous, mation method, DWPose (Yang et al. 2023), to accurately
and we consolidated these synonyms into a single identify keypoint positions for each action frame. Fig.5 com-
gloss. One-to-many and many-to-one relationships of- pares the 2D keypoints identified by MediaPipe (Lugaresi
ten pertained to phrases, and we determined the use of et al. 2019) from the POENIX-2014 dataset with those iden-
phrases based on the frequency of gloss occurrences. tified by DWPose. The keypoints from DWPose demonstrate
significantly higher quality than those from previous meth-
2. We developed a script to automate the identification ods.
and correction of inconsistencies in sign language an- Setup Details The experimental setup was meticulously de-
notations. This script analyzed discrepancies between signed to ensure the robustness and reproducibility of our
the test set and the ground truth, identifying common results. We begin by introducing the datasets used in the ex-
error types derived from the calculation of Word Er- periment:
ror Rate (WER), such as ”C-S-I-C”, ”C-I-S-C”, ”C-D-
S-C”, and ”C-S-D-C”. In these acronyms, each letter • PHOENIX14 is a German sign language dataset focused
represents a specific error type: Correct (C), Insertion on weather forecasts, containing 5,672 training samples,
(I), Substitution (S), and Deletion (D). Linguistics ex- 540 development samples, and 629 test samples.
perts then examined and corrected frequently confused • PHOENIX14-T extends PHOENIX14 with both gloss
patterns. Additionally, the script maintained a record and translation annotations, including 7,096 training
of previously processed annotation pairs to avoid re- samples, 519 development samples, and 642 test sam-
dundancy, thereby improving the efficiency and accu- ples.
racy of the data cleaning process. Through this itera- • CSL-Daily is a Chinese sign language dataset with gloss
tive process, we successfully reduced the vocabulary and translation annotations, comprising 18,401 training
size from 7,000 to 5,000, significantly enhancing the samples, 1,077 development samples, and 1,176 test
quality of the dataset. samples.
• SCOPE, as previously mentioned, is a Chinese sign lan-
• Database Management: All collected data, including guage dataset featuring gloss annotations, translation an-
raw videos, annotations, and metadata, were stored in a notations, and dialogue context information, with a total
structured database. This database facilitated easy access of 59,231 samples across training, development, and test
and management of the dataset for further processing and sets.
analysis.
Experiments were conducted on a cluster of 8 NVIDIA
A800 GPUs. Each GPU was assigned a subset of the dataset
Data Statistics The SCOPE dataset covers a wide range to facilitate parallel processing and efficient resource utiliza-
of scenarios and dialogue contexts, as detailed in Tab.5. It tion. All sign language videos were preprocessed to extract
includes professional settings such as medical treatment, 2D keypoints using DWPose, which were then normalized
work, and education, as well as daily situations like enter- using our Iris Normalization technique to account for varia-
tainment, family communication, and shopping. The dataset tions in video resolution and camera distance. Training de-
contains 33,154 clips under 5 seconds and 26,077 clips over tails are provided below:
5 seconds, with an average length of 5.05 seconds. The av-
erage sentence length is 11.84 characters, highlighting the • SCOPE: As mentioned in the previous section.
dataset’s richness. Fig.4 presents visual examples from the • SCOPE without context: Most settings are similar to the
SCOPE dataset, showcasing its diversity. full pipeline, but we use random noise instead of context
Figure 4: SCOPE gallery. We sampled different scenarios and show case the dataset sign videos and annotations.
MediaPipe
Inference Gloss
Dataset Gloss
Inference Sentence
DWPose
Rating Rating
Figure 5: Comparison of 2D keypoints identified by Me- Figure 6: User study on dataset and inference quality. In
diaPipe (up) and DWPose (down) from the Phoenix 2014 our user interface, participants rank the quality of the dataset
dataset. DWPose provides more accurate and detailed key- gloss (upper right) and the inference results (lower right) in
points. relation to the sign video (upper left and lower left). A higher
rank indicates better quality.
embedding for the embedding alignment encoder and
mask their attention. For the LLM, the prompt was mod- More Evaluation To demonstrate the robustness of our
ified to ”Summarizing the sentence using top 3 glosses pipeline, we conducted extensive evaluations comparing our
only.” SCOPE framework with several existing methods. The re-
• SEN-CSLR, CorrNet, and CV-SLT: We pretrained the sults are summarized in Tab.6.
models on our dataset following their detailed instruc-
tions. User Study We conducted our user study in a local hos-
• Two-Stream SLR: We first applied single-stream pre- pital, where Deaf individuals used our demo to communi-
training on our video and keypoint data modalities, fol- cate with the doctor. The doctor either typed responses or
lowed by two-stream joint training to predict gloss re- employed speech recognition to generate text for the Deaf
sults. patients to read, we also employed sign motion generation
• MMTLB and Two-Stream SLT: For MMTLB and the via the text. This demo is particularly useful in this context,
Two-Stream Network, we pretrained the sign-to-gloss as Deaf individuals often struggle to see the typing window
module and extracted visual features before performing while receiving dental treatment. An overview of the demo
joint training for the final translation process. is shown in Fig.7. After using the demo, we asked users to
rate the software, as well as the quality of our dataset and in-
Phoenix-2014 Phoenix-2014T CSL-Daily SCOPE
Method
Dev Test Dev Test Dev Test Dev Test
SubUNets (Cihan Camgoz et al. 2017) 40.8 40.7 - - 41.4 41.0 - -
LS-HAN (Huang et al. 2018) - - - - 39.0 39.4 - -
IAN (Pu, Zhou, and Li 2019) 37.1 36.7 - - - - - -
ReSign (Koller, Zargaran, and Ney 2017b) 27.1 26.8 - - - - - -
CNN-LSTM-HMMs (Multi-Stream) (Koller et al. 2019) 26.0 26.0 22.1 24.1 - - - -
SFL (Niu and Mak 2020) 24.9 25.3 25.1 26.1 - - - -
DNF (RGB) (Cui, Liu, and Zhang 2019) 23.8 24.4 - - 32.8 32.4 - -
FCN (Cheng et al. 2020) 23.7 23.9 23.3 25.1 33.2 33.5 - -
DNF (RGB+Flow) (Cui, Liu, and Zhang 2019) 23.1 22.9 - - - - - -
Joint-SLRT (Camgoz et al. 2020) - - 24.6 24.5 33.1 32.0 - -
VAC (Min et al. 2021) 21.2 22.3 - - - - - -
LCSA (Zuo and Mak 2022b) 21.4 21.9 - - - - - -
CMA (Pu et al. 2020) 21.3 21.9 - - - - - -
SignBT (Zhou et al. 2021a) - - 22.7 23.9 33.2 33.2 - -
MMTLB (Chen et al. 2022a) - - 21.9 22.5 - - - -
SMKD (Hao, Min, and Chen 2021) 20.8 21.0 20.8 22.4 - - - -
STMC-R (RGB+Pose) (Zhou et al. 2021b) 21.1 20.7 19.6 21.0 - - - -
C 2 SLR (RGB+Pose) (Zuo and Mak 2022a) 20.5 20.4 20.2 20.4 - - - -
SEN-CSLR (Hu et al. 2023c) 19.5 20.9 19.3 20.7 31.1 30.7 40.2 41.1
TwoStream-SLR (Chen et al. 2022b) 18.4 18.8 17.7 19.3 25.4 25.3 40.8 40.5
SlowFastSign (Ahn, Jang, and Chung 2024) 18.0 18.2 17.6 18.7 25.4 24.8 35.2 35.6
CorrNet (Hu et al. 2023b) 18.9 19.7 18.9 20.5 30.6 30.1 33.5 33.8
Ours-SLR∗ w/o Context 18.8 19.2 17.8 19.0 22.7 23.1 30.2 30.7
Ours-SLR∗ - - - - - - 28.0 27.4
Table 6: Quantitative evaluation of various Sign Language Recognition (SLR) tasks. Word Error Rate (WER) is used as the
evaluation metric. We trained both our method and other approaches on the SCOPE dataset. Additionally, our model without
context input was evaluated on other popular datasets. The red entries indicate the best results, while the blue entries represent
the second-best results.
of scenarios and includes diverse participants, there are
still many professional contexts that need to be addressed,
such as emergencies (fire, police), insurance consultancy,
pet medical treatment, and transportation navigation for the
Deaf. Future dataset collection should explore these scenar-
ios. Second, while SCOPE can incorporate contextual infor-
mation in sign language processing, excessively long con-
texts may have minimal impact on current sentences. Inves-
tigating an attention decay mechanism should be a future
research direction. Finally, regarding speed normalization,
although input image sizes are scaled using our iris normal-
Figure 7: The demo usage scene in hospital. ization, variations in sign speed among signers can compli-
cate the recognition network’s ability to process sign frames
effectively. Future explorations should focus on addressing
ference results, as illustrated in Fig.6. The results presented sign speed variations.
in the Experiment section demonstrate that our solution is
well-received by the Deaf community. Broader Impact. The development of the SCOPE frame-
work and dataset offers significant benefits to both the Deaf
community and the research field. By enhancing the accu-
Limitation and Broader Impact
racy and context-awareness of sign language recognition,
Limitation. While our SCOPE framework and dataset rep- our work fosters better communication between Deaf and
resent significant advancements in sign language recogni- hearing individuals, promoting inclusiveness. Additionally,
tion and translation, several limitations must be considered. the SCOPE dataset serves as a valuable resource for devel-
First, although the SCOPE dataset covers a wide range oping educational tools that teach sign language and raise
awareness in the broader community. Our real-time sign lan-
guage translation application is particularly advantageous in
healthcare settings, where effective communication is essen-
tial for improving care quality for Deaf patients. Further-
more, the insights gained from our research can contribute
to advancements in areas such as gesture recognition, vir-
tual reality, and augmented reality. By open-sourcing our
dataset and code, we aim to stimulate further research in sign
language processing, leading to innovations that benefit the
community. In summary, our work has the potential to drive
positive change and innovation, ultimately contributing to a
more inclusive and accessible society.