Papers by Necati Cihan Camgoz

An important goal across most scientific fields is the discovery of causal structures underling a... more An important goal across most scientific fields is the discovery of causal structures underling a set of observations. Unfortunately, causal discovery methods which are based on correlation or mutual information can often fail to identify causal links in systems which exhibit dynamic relationships. Such dynamic systems (including the famous coupled logistic map) exhibit `mirage' correlations which appear and disappear depending on the observation window. This means not only that correlation is not causation but, perhaps counter-intuitively, that causation may occur without correlation. In this paper we describe Neural Shadow-Mapping, a neural network based method which embeds high-dimensional video data into a low-dimensional shadow representation, for subsequent estimation of causal links. We demonstrate its performance at discovering causal links from video-representations of dynamic systems.

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021
Predicting 3D human pose from a single monoscopic video can be highly challenging due to factors ... more Predicting 3D human pose from a single monoscopic video can be highly challenging due to factors such as low resolution, motion blur and occlusion, in addition to the fundamental ambiguity in estimating 3D from 2D. Approaches that directly regress the 3D pose from independent images can be particularly susceptible to these factors and result in jitter, noise and/or inconsistencies in skeletal estimation. Much of which can be overcome if the temporal evolution of the scene and skeleton are taken into account. However, rather than tracking body parts and trying to temporally smooth them, we propose a novel transformer based network that can learn a distribution over both pose and motion in an unsupervised fashion. We call our approach Skeletor. Skeletor overcomes inaccuracies in detection and corrects partial or entire skeleton corruption. Skeletor uses strong priors learn from on 25 million frames to correct skeleton sequences smoothly and consistently. Skeletor can achieve this as i...

Sign language is the natural medium of communication for the Deaf community. In this study, we ha... more Sign language is the natural medium of communication for the Deaf community. In this study, we have developed an interactive communication interface for hospitals, HospiSign, using computer vision based sign language recognition methods. The objective of this paper is to review sign language based Human-Computer Interaction applications and to introduce HospiSign in this context. HospiSign is designed to meet deaf people at the information desk of a hospital and to assist them in their visit. The interface guides the deaf visitors to answer certain questions and express intention of their visit, in sign language, without the need of a translator. The system consists of a computer, a touch display to visualize the interface, and Microsoft Kinect v2 sensor to capture the users’ sign responses. HospiSign recognizes isolated signs in a structured activity diagram using Dynamic Time Warping based classifiers. In order to evaluate the developed interface, we performed usability tests and ...
ArXiv, 2021
Causal reasoning is a crucial part of science and human intelligence. In order to discover causal... more Causal reasoning is a crucial part of science and human intelligence. In order to discover causal relationships from data, we need structure discovery methods. We provide a review of background theory and a survey of methods for structure discovery. We primarily focus on modern, continuous optimization methods, and provide reference to further resources such as benchmark datasets and software packages. Finally, we discuss the assumptive leap required to take us from structure to causality.

ArXiv, 2021
In this paper, we focus on the task of one-shot sign spotting, i.e. given an example of an isolat... more In this paper, we focus on the task of one-shot sign spotting, i.e. given an example of an isolated sign (query), we want to identify whether/where this sign appears in a continuous, co-articulated sign language video (target). To achieve this goal, we propose a transformer-based network, called Sign-Lookup. We employ 3D Convolutional Neural Networks (CNNs) to extract spatio-temporal representations from video clips. To solve the temporal scale discrepancies between the query and the target videos, we construct multiple queries from a single video clip using different frame-level strides. Self-attention is applied across these query clips to simulate a continuous scale space. We also utilize another self-attention module on the target video to learn the contextual within the sequence. Finally a mutual-attention is used to match the temporal scales to localize the query within the target sequence. Extensive experiments demonstrate that the proposed approach can not only reliably iden...

Int. J. Comput. Vis., 2021
Sign languages are multi-channel visual languages, where signers use a continuous 3D space to com... more Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production. In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction ...

Although there have been some recent advances in sign language recognition, part of the problem i... more Although there have been some recent advances in sign language recognition, part of the problem is that most computer scientists in this research area do not have the required in-depth knowledge of sign language, and often have no connection with the Deaf community or sign linguists. For example, one project described as translation into sign language aimed to take subtitles and turn them into fingerspelling. This is one of many reasons why much of this technology, including sign-language gloves, simply doesn’t address the challenges. However there are benefits to achieving automatic sign language recognition. The process of annotating and analysing sign language data on video is extremely labour-intensive. Sign language recognition technology could help speed this up. Until recently we have lacked large signed video datasets that have been precisely and consistently transcribed and translated – these are needed to train computers for automation. But sign language corpora – large da...

ArXiv, 2021
The visual anonymisation of sign language data is an essential task to address privacy concerns r... more The visual anonymisation of sign language data is an essential task to address privacy concerns raised by largescale dataset collection. Previous anonymisation techniques have either significantly affected sign comprehension or required manual, labour-intensive work. In this paper, we formally introduce the task of Sign Language Video Anonymisation (SLVA) as an automatic method to anonymise the visual appearance of a sign language video whilst retaining the meaning of the original sign language sequence. To tackle SLVA, we propose ANONYSIGN, a novel automatic approach for visual anonymisation of sign language data. We first extract pose information from the source video to remove the original signer appearance. We next generate a photo-realistic sign language video of a novel appearance from the pose sequence, using image-to-image translation methods in a conditional variational autoencoder framework. An approximate posterior style distribution is learnt, which can be sampled from t...
Computational sign language research lacks the large-scale datasets that enables the creation of ... more Computational sign language research lacks the large-scale datasets that enables the creation of useful reallife applications. To date, most research has been limited to prototype systems on small domains of discourse, e.g. weather forecasts. To address this issue and to push the field forward, we release six datasets comprised of 190 hours of footage on the larger domain of news. From this, 20 hours of footage have been annotated by Deaf experts and interpreters and is made publicly available for research purposes. In this paper, we share the dataset collection process and tools developed to enable the alignment of sign language video and subtitles, as well as baseline translation results to underpin future research.
Sign languages are visual languages produced by the movement of the hands, face, and body. In thi... more Sign languages are visual languages produced by the movement of the hands, face, and body. In this paper, we evaluate representations based on skeleton poses, as these are explainable, person-independent, privacy-preserving, low-dimensional representations. Basically, skeletal representations generalize over an individual’s appearance and background, allowing us to focus on the recognition of motion. But how much information is lost by the skeletal representation? We perform two independent studies using two state-of-the-art pose estimation systems. We analyze the applicability of the pose estimation systems to sign language recognition by evaluating the failure cases of the recognition models. Importantly, this allows us to characterize the current limitations of skeletal pose estimation approaches in sign language recognition.

To be truly understandable and accepted by Deaf communities, an automatic Sign Language Productio... more To be truly understandable and accepted by Deaf communities, an automatic Sign Language Production (SLP) system must generate a photo-realistic signer. Prior approaches based on graphical avatars have proven unpopular, whereas recent neural SLP works that produce skeleton pose sequences have been shown to be not understandable to Deaf viewers. In this paper, we propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language. We employ a transformer architecture with a Mixture Density Network (MDN) formulation to handle the translation from spoken language to skeletal pose. A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence. This allows the photo-realistic production of sign videos directly translated from written text. We further propose a novel keypoint-based loss function, which significantly improves the quality of synthesized ha...

Variational AutoEncoders (VAEs) provide a means to generate representational latent embeddings. P... more Variational AutoEncoders (VAEs) provide a means to generate representational latent embeddings. Previous research has highlighted the benefits of achieving representations that are disentangled, particularly for downstream tasks. However, there is some debate about how to encourage disentanglement with VAEs, and evidence indicates that existing implementations do not achieve disentanglement consistently. The evaluation of how well a VAE’s latent space has been disentangled is often evaluated against our subjective expectations of which attributes should be disentangled for a given problem. Therefore, by definition, we already have domain knowledge of what should be achieved and yet we use unsupervised approaches to achieve it. We propose a weakly-supervised approach that incorporates any available domain knowledge into the training process to form a Gated-VAE. The process involves partitioning the representational embedding and gating backpropagation. All partitions are utilised on ...

It is common practice to represent spoken languages at their phonetic level. However, for sign la... more It is common practice to represent spoken languages at their phonetic level. However, for sign languages, this implies breaking motion into its constituent motion primitives. Avatar based Sign Language Production (SLP) has traditionally done just this, building up animation from sequences of hand motions, shapes and facial expressions. However, more recent deep learning based solutions to SLP have tackled the problem using a single network that estimates the full skeletal structure. We propose splitting the SLP task into two distinct jointly-trained sub-tasks. The first translation sub-task translates from spoken language to a latent sign language representation, with gloss supervision. Subsequently, the animation sub-task aims to produce expressive sign language sequences that closely resemble the learnt spatiotemporal representation. Using a progressive transformer for the translation sub-task, we propose a novel Mixture of Motion Primitives (MOMP) architecture for sign language a...

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Disentangled representations support a range of downstream tasks including causal reasoning, gene... more Disentangled representations support a range of downstream tasks including causal reasoning, generative modeling, and fair machine learning. Unfortunately, disentanglement has been shown to be impossible without the incorporation of supervision or inductive bias. Given that supervision is often expensive or infeasible to acquire, we choose to incorporate structural inductive bias and present an unsupervised, deep State-Space-Model for Video Disentanglement (VDSM). The model disentangles latent time-varying and dynamic factors via the incorporation of hierarchical structure with a dynamic prior and a Mixture of Experts decoder. VDSM learns separate disentangled representations for the identity of the object or person in the video, and for the action being performed. We evaluate VDSM across a range of qualitative and quantitative tasks including identity and dynamics transfer, sequence generation, Fréchet Inception Distance, and factor classification. VDSM achieves state-of-the-art performance and exceeds adversarial methods, even when the methods use additional supervision.

2021 IEEE International Conference on Smart Data Services (SMDS)
Undertaking causal inference with observational data is incredibly useful across a wide range of ... more Undertaking causal inference with observational data is incredibly useful across a wide range of tasks including the development of medical treatments, advertisements and marketing, and policy making. There are two significant challenges associated with undertaking causal inference using observational data: treatment assignment heterogeneity (i.e., differences between the treated and untreated groups), and an absence of counterfactual data (i.e., not knowing what would have happened if an individual who did get treatment, were instead to have not been treated). We address these two challenges by combining structured inference and targeted learning. In terms of structure, we factorize the joint distribution into risk, confounding, instrumental, and miscellaneous factors, and in terms of targeted learning, we apply a regularizer derived from the influence curve in order to reduce residual bias. An ablation study is undertaken, and an evaluation on benchmark datasets demonstrates that TVAE has competitive and state of the art performance.

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021
An important goal across most scientific fields is the discovery of causal structures underling a... more An important goal across most scientific fields is the discovery of causal structures underling a set of observations. Unfortunately, causal discovery methods which are based on correlation or mutual information can often fail to identify causal links in systems which exhibit dynamic relationships. Such dynamic systems (including the famous coupled logistic map) exhibit ‘mirage’ correlations which appear and disappear depending on the observation window. This means not only that correlation is not causation but, perhaps counter-intuitively, that causation may occur without correlation. In this paper we describe Neural Shadow-Mapping, a neural network based method which embeds high-dimensional video data into a low-dimensional shadow representation, for subsequent estimation of causal links. We demonstrate its performance at discovering causal links from video-representations of dynamic systems.
Companion Publication of the 2020 International Conference on Multimodal Interaction

International Journal of Computer Vision
We present a novel approach to automatic Sign Language Production using recent developments in Ne... more We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to...
16th European Conference on Computer Vision (ECCV), ACVR Workshop, 2020
Sign languages use multiple asynchronous information channels (articulators), not just the hands ... more Sign languages use multiple asynchronous information channels (articulators), not just the hands but also the face and body, which computational approaches often ignore. In this paper we tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between different sign artic-ulators to be modelled within the transformer network itself, while also maintaining channel specific information. We evaluate our approach on the RWTH-PHOENIX-Weather-2014T dataset and report competitive translation performance. Importantly, we overcome the reliance on gloss annotations which underpin other state-of-the-art approaches, thereby removing the need for expensive curated datasets.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Prior work on Sign Language Translation has shown that having a mid-level sign gloss representati... more Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
Uploads
Papers by Necati Cihan Camgoz
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.