Synchronous bidirectional inference for neural sequence generation
Zhang, Jiajun, etc.
ARTIFICIAL INTELLIGENCE, 2020, 281
In sequence to sequence generation tasks (e.g. machine translation and abstractive summarization), inference is generally performed in a le-to-right manner to produce the result token by token. The neural approaches, such as LSTM and self-attention networks, are now able to make full use of all the predicted history hypotheses from le side during inference, but cannot meanwhile access any future (right side) information and usually generate unbalanced outputs (e.g. le parts are much more accurate than right ones in Chinese-English translation). In this work, we propose a synchronous bidirectional inference model to generate outputs using both le-to-right and right-to-le decoding simultaneously and interactively. First, we introduce a novel beam search algorithm that facilitates synchronous bidirectional decoding. Then, we present the core approach which enables le-to-right and rightto-le decoding to interact with each other, so as to utilize both the history and future predictions simultaneously during inference. We apply the proposed model to both LSTM and self-attention networks. Furthermore, we propose a novel fine-tuning based parameter optimization algorithm in addition to the simple two-pass strategy. The extensive experiments on machine translation and abstractive summarization demonstrate that our synchronous bidirectional inference model can achieve remarkable improvements over the strong baselines.
Exploiting reverse target-side contexts for neural machine translation via asynchronous bidirectional decoding
Su, Jinsong, etc.
ARTIFICIAL INTELLIGENCE, 2019, 277
Based on a unified encoder-decoder framework with attentional mechanism, neural machine translation (NMT) models have attracted much attention and become the mainstream in the community of machine translation. Generally, the NMT decoders produce translation in a le-to-right way. As a result, only le-to-right target-side contexts from the generated translations are exploited, while the right-to-le target-side contexts are completely unexploited for translation. In this paper, we extend the conventional attentional encoder-decoder NMT framework by introducing a backward decoder, in order to explore asynchronous bidirectional decoding for NMT. In the first step aer encoding, our backward decoder learns to generate the target-side hidden states in a right-to-le manner. Next, in each timestep of translation prediction, our forward decoder concurrently considers both the source-side and the reverse target-side hidden states via two attention models. Compared with previous models, the innovation in this architecture enables our model to fully exploit contexts from both source side and target side, which improve translation quality altogether. We conducted experiments on NIST Chinese-English, WMT English-German and Finnish-English translation tasks to investigate the e?ectiveness of our model. Experimental results show that (1) our improved RNNbased NMT model achieves significant improvements over the conventional RNNSearch by 1.44/-3.02, 1.11/-1.01, and 1.23/-1.27 average BLEU and TER points, respectively; and (2) our enhanced Transformer outperforms the standard Transformer by 1.56/-1.49, 1.76/-2.49, and 1.29/-1.33 average BLEU and TER points, respectively.
Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks
Stoll, Stephanie, etc.
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128(4) SI: 891-908
We present a novel approach to automatic Sign Language Production using recent developments in Neural Machine Translation (NMT), Generative Adversarial Networks, and motion generation. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign pose sequences by combining an NMT network with a Motion Graph. The resulting pose information is then used to condition a generative model that produces photo realistic sign language video sequences. This is the first approach to continuous sign video generation that does not use a classical graphical avatar. We evaluate the translation abilities of our approach on the PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach for both multi-signer and high-definition settings qualitatively and quantitatively using broadcast quality assessment metrics.
Neural Machine Translation with Deep Attention
Zhang, Biao, etc.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42(1): 154-163
Deepening neural models has been proven very successful in improving the models capacity when solving complex learning tasks, such as the machine translation task. Previous e?orts on deep neural machine translation mainly focus on the encoder and the decoder, while little on the attention mechanism. However, the attention mechanism is of vital importance to induce the translation correspondence between di?erent languages where shallow neural networks are relatively insu?icient, especially when the encoder and decoder are deep. In this paper, we propose a deep attention model (DeepAtt). Based on the low-level attention information, DeepAtt is capable of automatically determining what should be passed or suppressed from the corresponding encoder layer so as to make the distributed representation appropriate for high-level attention and translation. We conduct experiments on NIST ChineseEnglish, WMT English-German, and WMT English-French translation tasks, where, with five attention layers, DeepAtt yields very competitive performance against the state-of-the-art results. We empirically find that with an adequate increase of attention layers, DeepAtt tends to produce more accurate attention weights. An in-depth analysis on the translation of important context words further reveals that DeepAtt significantly improves the faithfulness of system translations.