Bloomberg’s AI Engineering Group & CTO Office Publish 11 NLP Research Papers at ACL 2023

July 09, 2023

During the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) in Toronto, Canada this week, researchers from Bloomberg’s AI Engineering Group and CTO Office are showcasing their expertise in natural language processing (NLP) by publishing 11 papers, five of which appear in Findings of the Association for Computational Linguistics: ACL 2023.

Through these papers, the authors and their collaborators highlight a variety of NLP applications, novel approaches and improved models used in key tasks, and other advances to the state-of-the-art in the field of computational linguistics.

We asked some of the authors to summarize their research and explain why the results were notable:

Evaluating Paraphrastic Robustness in Textual Entailment Models
Dhruv Verma (Stony Brook University), Yash Kumar Lal (Stony Brook University), Shreyashee Sinha (Bloomberg), Benjamin Van Durme (Johns Hopkins University), Adam Poliak (Bryn Mawr College)

Poster Session 1: Semantics: Sentence-level Semantics, Textual Inference, and Other Areas | Monday, July 10, 11:00 AM-12:30 PM EDT

Front page of ACL 2023 paper "Evaluating Paraphrastic Robustness in Textual Entailment Models"

Please summarize your research. Why are your results notable?

Shreyashee: The task of Recognizing Textual Entailment (RTE) involves predicting whether one sentence (the hypothesis) can be inferred from another (the premise). It plays a crucial role in natural language understanding (NLU).

To assess the reliability and robustness of RTE models in handling paraphrases, we introduce a test set called PaRTE. This set consists of examples from previous challenges, rewritten using a lexical rewriter (T5) while preserving the original meaning and label (entails or does not) of the sentence pairs. By evaluating how models respond to paraphrased examples, we can determine if their predictions remain consistent.

While this test alone cannot gauge the full language comprehension of RTE models, it is essential for any NLU system to be able to handle paraphrases. Our experiments demonstrate that contemporary models generally maintain consistent predictions when presented with paraphrased examples. However, our analysis reveals that models are more likely to change their predictions when both the premise and hypothesis undergo rewriting, as opposed to when only one of the sentences is modified.

We provide PaRTE to encourage others to evaluate the performance of their models with paraphrased RTE examples. Transformer-based models like RoBERTa exhibit higher robustness, while Bag-of-Words (BoW) and BiLSTM models show lower robustness. The table below from the paper demonstrates these results.

Table 2: Each row represents a model. The columns MNLI, RTE, ˆ PaRTE report the model’s accuracy on those test sets. The last column (% Δ ˆ PaRTE) reports the percentage of examples where the model changed its prediction.

How does your research advance the state-of-the-art in the field of natural language processing?

This research identifies the strengths and weaknesses of different models in handling paraphrases and opens a wide range of possible future work in the field. For instance, investigating the reasons behind the sensitivity and lack of robustness of Bag-of-Words (BoW) and BiLSTM models to paraphrasing and exploring alternative approaches or modifications to these models. Researchers could conduct experiments to understand the factors that contribute to the differences in robustness among transformer-based models such as RoBERTa, BERT, and GPT-3. This would involve exploring variations in model architectures, training strategies, or finetuning techniques to enhance their ability to handle paraphrases effectively.

Investigating the generalizability of the findings to other tasks related to NLU beyond RTE is another research prospect. Assessing the robustness of NLU models to paraphrasing in tasks like sentiment analysis, text classification, or machine translation can provide a comprehensive understanding of the impact of paraphrasing on different NLU domains.

Make it happen here.

SEARCH NOW

InfoSync: Information Synchronization across Multilingual Semi-structured Tables (Findings of the ACL 2023)
Siddharth Hemant Khincha (IIT Guwahati), Chelsi Jain (CTAE, Udaipur), Vivek Gupta (University of Utah), Tushar Kataria (University of Utah), Shuo Zhang (Bloomberg)

Virtual Poster Session 1: Resources and Evaluation | Monday, July 10, 11:00 AM-12:30 PM EDT
Spotlight Session: Resources and Evaluation | Monday, July 10, 7:00-9:00 PM EDT
The First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023), Oral Presentations (OP2) | Thursday, July 13, 2023, 1:50–2:40 PM EDT

Front page of Findings of the ACL 2023 paper "InfoSync: Information Synchronization across Multilingual Semi-structured Tables"

Please summarize your research. Why are your results notable?

Shuo: English articles on the web about specific subjects tend to be updated more frequently than those in other languages. However, cultural differences, topic preferences, and editing inconsistencies can lead to information mismatches across multilingual data, such as outdated or missing information. Online encyclopedias like Wikipedia contain millions of articles that require constant updates, which involves expanding existing articles, modifying content, correcting facts in sentences, and altering categories. Despite this, more than 40% of Wikipedia’s active editors primarily work in English, even though only 15% of the global population speaks English as their first language. As a result, information in articles written in non-English languages may not be as up-to-date. This work focuses on the need to synchronize information across multilingual content, as illustrated in the below example (Figure B) of mismatched information for the same entity in different languages.

Figure B. Janaki Ammal Infoboxes in English (right) and Hindi (left). The Hindi Infobox table lacks the "British Rule of India" as a cultural context. Two value mismatches (a) The Hindi Infobox table doesn't list the state in which he died and (b) the Institution values differ. The Hindi table mentions "residence," while the English table doesn't. The Hindi Infobox table is also missing Thesis, Awards, and Alma Mater keys. Neither mentions parents, early education, or honors. — Figure B. Janaki Ammal Infoboxes in English (right) and Hindi (left). The Hindi Infobox table lacks the “British Rule of India” as a cultural context. Two value mismatches (a) The Hindi Infobox table doesn’t list the state in which he died and (b) the Institution values differ. The Hindi table mentions “residence,” while the English table doesn’t. The Hindi Infobox table is also missing Thesis, Awards, and Alma Mater keys. Neither mentions parents, early education, or honors.

To systematically address this challenge, we’ve curated a dataset called InfoSync, which consists of 100,000 multilingual Infobox tables across 14 languages, covering 21 Wikipedia categories. Approximately 3,500 table pairs of English to non-English or non-English to non-English are sampled and manually synchronized. We propose a two-step approach (alignment and update) that demonstrates superior performance compared to existing baselines. In addition, the rule-based update system achieves excellent acceptance rates when used for human-assisted Wikipedia editing.

How does your research advance the state-of-the-art in the field of natural language processing?

We formally present the task of Information Synchronization for multilingual articles, encompassing paragraphs, tables, lists, categories, and images. However, due to the immense complexity of synchronizing all types of information across various modalities on a web page, we focus on semi-structured data, specifically table synchronization in a few languages, as an initial step towards our goal.

We propose a table synchronization method consisting of two steps: (1) Information Alignment, which involves aligning table rows across languages, and (2) Information Update, which updates missing or outdated rows across language pairs to address inconsistencies. The information alignment component aims to align rows in multilingual tables using corpus statistics from Wikipedia, such as key and value-based similarities. The information update step employs an efficient rule-based approach, with nine manually-curated rules: row transfer, time-based, value trends (positive and negative), multi-key matching, append value, high to low resource, number of row differences, and rare keys.

We evaluate both tasks on the InfoSync dataset to demonstrate their effectiveness. We find that the developed resources can significantly boost research in the area of information synchronization, particularly when using Wikipedia tables for applications like knowledge graph construction.

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
Shiyue Zhang (UNC Chapel Hill), Shijie Wu (Bloomberg), Ozan İrsoy (Bloomberg), Steven Lu (Bloomberg), Mohit Bansal (UNC Chapel Hill), Mark Dredze (Bloomberg/Johns Hopkins University), David Rosenberg (Bloomberg)

Poster Session 2: Large Language Models | Monday, July 10, 2:00-3:30 PM EDT

Front page of ACL 2023 paper "MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies"

Please summarize your research. Why are your results notable?

Shiyue: Although human text usually has low perplexity under strong language models (LMs), random sampling from these models still results in non-human-like text. We believe that these LMs are overgeneralized, in the sense that they have larger support than human LM, and we think the standard maximum likelihood estimation (MLE) learning objective contributes to this problem. MLE encourages the model to cover every single example in the training set. Since data noise is inevitable, MLE makes the model put non-trivial probability mass on non-human-like text.

To tackle this problem, we propose MixCE, augmenting MLE – also known as forward cross-entropy – with reverse cross-entropy (CE) of data distribution P relative to model distribution Q. Reverse CE reflects how humans evaluate model-generated text and can effectively discourage the model to focus on correcting instances it does poorly on, which are often data noise or rare examples.

Figure 1: MIXCE combines two complementary driving forces: reverse CE helps narrow the model distribution Qθ down when it is broader than data distribution P, while forward CE helps broaden Qθ out when it is narrower than P2

Optimizing reverse CE is intractable because P is unknown. We propose an approximation that ends up being a self-reinforced loss that encourages the model to produce text in which it is already confident. Since MLE-pretrained GPT-2 assigns higher probabilities to human text than to text sampled from the model, self-reinforcing effectively “pushes” the model distribution toward the human distribution. The final approximate MixCE loss is easy to implement – and is as fast as MLE.

Figure 2: The histograms of sequence-level and token-level negative log-likelihoods of human texts and model generations from GPT-2 large.

We demonstrate the effectiveness of MixCE in both synthetic and real-data settings. By finetuning GPT-2 in three text domains, we show that, compared to MLE, random sampling from MixCE-trained models produces text that scores better under automatic metrics (e.g., diversity, coherence, and mauve), and it is preferred by humans. When using top-p sampling and carefully tuning the probability threshold, MixCE performs on par with MLE. Nonetheless, the optimal threshold of MixCE is closer to 1, implying a less noisy model distribution.

How does your research advance the state-of-the-art in the field of natural language processing?

The state-of-the-art large-scale pretrained LMs are usually trained by MLE. People observe that larger and cleaner data leads to better performance and, no matter which LM is used, the decoding method needs to be chosen wisely to get the desired behavior. These are consistent with our overgeneralization hypothesis for MLE. In this work, we propose an alternative learning objective to MLE from first principles. We believe that our MixCE objective can alleviate the overgeneralization problem, especially when the training data is limited or noisy.

On “Scientific Debt” in NLP: A Case for More Rigour in Language Model Pre-Training Research
Made Nindyatama Nityasya (Independent Researcher), Haryo Akbarianto Wibowo (Independent Researcher), Alham Fikri Aji (MBZUAI), Genta Indra Winata (Bloomberg), Radityo Eko Prasojo (Universitas Indonesia), Phil Blunsom (Cohere.AI/6University of Oxford), Adhiguna Kuncoro (DeepMind)

Oral Session 2: Theme: Reality Check | Monday, July 10, 2:00-3:30 PM EDT

Front page of ACL 2023 paper "On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research"

Please summarize your research. Why are your results notable?

Genta: This evidence-based position paper critiques current research practices within the study of the language model pre-training. Despite rapid recent progress afforded by increasingly better pre-trained language models (PLMs), current PLM research practices often conflate different possible sources of model improvement, without conducting proper ablation studies and principled comparisons between different models under comparable conditions. These practices leave us ill-equipped to understand which pre-training approaches should be used under what circumstances, impeding reproducibility and credit assignment.

We provide a case to revisit the success of BERT over its baselines, ELMo and GPT-1, and demonstrate how these baselines (and even-simpler variants thereof) can, in fact, achieve competitive or better performance than BERT (See Table 1). These findings demonstrate how disentangling different factors of model improvements can lead to valuable new insights. We conclude with recommendations for how to encourage and incentivize this line of work in order to accelerate progress towards a better and more systematic understanding of what factors are driving the progress of today’s foundational models.

Table 1: GLUE test results. We use F1 scores for MRPC and QQP, Matthew’s Correlation for CoLA, SpearmanR for STS-B, and accuracy for the rest; all models are pre-trained with the same batch size & compute (1M steps).

How does your research advance the state-of-the-art in the field of natural language processing?

Recently, rapid progress within the PLM literature has led to tremendous advances within the NLP field. Despite this progress, current PLM research practices that change multiple different PLM components at once, often without conducting proper ablation studies or principled comparisons that disentangle the impact of different components, have introduced certain issues that we call “scientific debt.”

Through experiments that disentangle the contribution of BERT’s bidirectional masked LM objective through principled comparison with prior work, we demonstrate how asking “which factors contributed the most to the model performance that we observe today?” can lead to valuable new insights. We offer several recommendations to encourage and incentivize this line of work, which aims to better understand how each factor contributes to the progress of our PLMs today. Doing so will enable researchers to better address the ongoing accumulation of scientific debt within current PLM research literature.

NusaCrowd: Open Source Initiative for Indonesian NLP Resources (Findings of the ACL 2023)
Samuel Cahyawijaya (HKUST/INACL), Holy Lovenia (HKUST/INACL), Alham Fikri Aji (MBZUAI), Genta Indra Winata (Bloomberg), Bryan Wilie (HKUST/INACL), Fajri Koto (MBZUAI/INACL), Rahmad Mahendra (Universitas Indonesia/INACL), Christian Wibisono (Institut Teknologi Bandung), Ade Romadhony (Telkom University/INACL), Karissa Vincentio (JULO/INACL), et al.

Spotlight Session: Resources and Evaluation | Monday, July 10, 7:00-9:00 PM EDT
Virtual Poster Session 7: Resources and Evaluation | Wednesday, July 12, 11:00 AM-12:30 PM EDT
Third Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP) | Friday, July 14, 2023

Front page of Findings of the ACL 2023 paper "NusaCrowd: Open Source Initiative for Indonesian NLP Resources"

Please summarize your research. Why are your results notable?

Genta: We present NusaCrowd, the very first collaborative initiative to collect and unite existing resources for Indonesian languages, which have more than 270 million speakers combined. This also includes opening access to previously non-public resources.

Through this initiative, we have brought together 137 datasets and 117 standardized data loaders (See Figure 3). The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd’s data collection enables the creation of the first zero-shot benchmarks for natural language understanding and natural language generation in Indonesian and its local languages. Furthermore, the NusaCrowd initiative led to the creation of the first multilingual automatic speech recognition benchmark for Indonesian and its local languages. Our work is intended to help advance NLP research in underrepresented languages.

We also evaluated our benchmark by prompting multilingual generative models (i.e., BLOOMZ, XGLM) and compared the performance with the encoder-only model finetuning and zero-shot cross-task transfer using XLM-R, as shown in Figure 4.

How does your research advance the state-of-the-art in the field of natural language processing?

In the last decades, research on Indonesian NLP has not focused on providing open source resources to the research community and most efforts to collect datasets have usually remained private. Through this initiative, we want to start promoting the importance of having public datasets, thereby enabling more researchers to accelerate the research and development process to benefit the NLP research community. We see NusaCrowd as a data hub for Indonesian and regional languages since it facilitates easy access to datasets, while also accelerating research and development.

Joint End-to-end Semantic Proto-role Labeling
Elizabeth Spaulding (University of Colorado Boulder), Gary Kazantsev (Bloomberg), Mark Dredze (Bloomberg)

Poster Session 3: Information Extraction | Tuesday, July 11, 9:00-10:30 AM EDT

Front page of ACL 2023 paper "Joint End-to-End Semantic Proto-role Labeling"

Please summarize your research. Why are your results notable?

Elizabeth: We investigated the task of semantic proto-role labeling – also known as SPRL, which assigns fine-grained properties to the relationship between events and the participants in those events – as a part of a larger information extraction pipeline. Let’s consider the sentence “John bought an umbrella.” We want to transform this sentence into a structure that a computer can understand. A semantic proto-role labeling system would tell us that John instigated the buying event, and that the umbrella changed owners as a result of the buying event. We developed a system that jointly identifies “bought” as the predicate (action) of the sentence, and that “John” and “umbrella” are arguments of this action.

Previous papers have mostly considered systems that make independent decisions for this task in isolation, and none have analyzed the problem in-depth as a part of a joint predicate-argument extraction pipeline. We extend architecture from previous work to enable joint predicate and argument extraction (Figure 1), and break down our results component-by-component to investigate how errors in predicate and argument identification propagate to proto-role labeling.

Figure 1: Architecture of our end-to-end semantic proto-role labeling system. Given a sentence and a candidate predicate, the model jointly learns to output whether or not the candidate is a predicate, the arguments of the candidate predicate, and the proto-role properties for each argument.

How does your research advance the state-of-the-art in the field of natural language processing?

This type of computational analysis is an important element in understanding the type of text we process every day at Bloomberg. Our paper provides a roadmap for focused improvements to semantic proto-role labeling systems, especially in the context of a larger information extraction system.

First, our model demonstrates the efficacy of SPRL when combined with a predicate-argument extraction pipeline. We are competitive with both span-based and dependency-based models and find that joint identification of predicates and arguments still produces a high-performing SPRL system. We also demonstrate that arguments that are difficult for the argument extraction component of the model are also difficult for the proto-role component of the model, suggesting that focusing on challenging arguments in future systems will improve proto-role labeling scores.

We also show that annotator disagreement, as well as the choice to collapse certain classes together in previous work, suggests that future work may benefit from handling the annotations differently from what is currently standard. Finally, we introduce a scoring system for SPRL which enables analysis of the proto-role labeler in conjunction with predicate and argument extraction by providing separate SPRL scores that incur different penalties when errors are made earlier in the pipeline.

RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue
Zhengliang Shi (Shandong University), Weiwei Sun (Shandong University), Shuo Zhang (Bloomberg), Zhen Zhang (Shandong University), Pengjie Ren (Shandong University), Zhaochun Ren (Shandong University)

Poster Session 6: Dialogue and Interactive Systems | Wednesday, July 12 9:00-10:30 AM EDT

Front page of ACL 2023 paper "RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue"

Please summarize your research. Why are your results notable?

Shuo: Open-domain dialogue systems, which are focused on casual, non-goal-oriented conversations, can engage in discussions on a wide variety of subjects. In recent years, we’ve seen rapid advancements in natural language generation, leading to significant progress in the development of these dialogue systems. Conversations with such systems are similar to human-to-human interactions, as multiple responses may be suitable for a given context, especially when users don’t have a specific goal other than to enjoy the conversation. This makes evaluating these conversations difficult due to the one-to-many problem, as illustrated in the image (Figure A) below.

*Figure A. An example illustrating the one-to-many nature of open-domain dialogues.*

To tackle the one-to-many problem, we introduce a new method called Reference-Assisted Dialogue Evaluation (RADE). RADE uses a pre-created response as a reference instead of a golden standard. To facilitate RADE, we’ve designed a new human annotation task to expand existing datasets, which incorporates metric decomposition and pairwise annotation. In this task, a pre-scored golden response is paired with generated responses that are rated using a unified rating score. The final scores are calculated by aggregating ratings with a weighted sum from different sub-metrics.

The human annotation process gathers labels for three high-quality datasets with 10,112 dialogues, corresponding to three open-domain dialogue system tasks: chitchat, empathetic dialogue, and personal chat. These multi-domain datasets enable RADE to be more robust when applied to cross-domain evaluation scenarios, while also maintaining improved task-specific performance.

How does your research advance the state-of-the-art in the field of natural language processing?

Dialogue evaluation typically relies on reference-based metrics, where a generated response is compared to a pre-created response, often called the golden standard. These metrics measure the similarity between generated and gold responses at either the lexical level (such as ROUGE or BLEU) or the semantic level (like BERTScore or ADEM). However, they don’t account for the one-to-many nature of open-domain dialogues, which means they may require higher consistency with human evaluations.

Recently, two approaches have emerged to address the limitations of reference-based metrics: multi-reference methods and reference-free methods. Multi-reference methods involve annotating multiple references for a dialogue, while reference-free methods discard the golden response in evaluations, thereby achieving higher correlations with human judgments. Despite their advantages, both approaches have drawbacks. Multi-reference methods can be expensive and challenging to apply to various datasets, whereas reference-free methods can be unstable and susceptible to data-induced biases.

By addressing their limitations, the methods proposed in this paper display substantial correlations with human evaluations in comprehensive experiments conducted on three benchmark datasets. The findings confirm the effectiveness, robustness, and broad applicability of these suggested approaches.

Don’t Retrain, Just Rewrite: Countering Adversarial Perturbations by Rewriting Text
Ashim Gupta (University of Utah), Carter Blum (Bloomberg), Temma Choji (Bloomberg), Yingjie Fei (Bloomberg), Shalin Shah (Bloomberg), Alakananda Vempala (Bloomberg), Vivek Srikumar (University of Utah)

Virtual Poster Session 7: Interpretability and Analysis of Models for NLP | Wednesday, July 12 11:00 AM-12:30 PM EDT

Front page of ACL 2023 paper "Don't Retrain, Just Rewrite: Countering Adversarial Perturbations by Rewriting Text"

Please summarize your research. Why are your results notable?

Alakananda: Neural models in NLP have been shown to be vulnerable to strategically modified samples, often referred to as adversarial examples. Generating input text with small, imperceptible perturbations that can fool a model to give false predictions, i.e., an adversarial attack, can happen both during training and at deployment. Defending against such attacks is important because it ensures the integrity and reliability of NLP systems. For example, if undefended, an attacker could manipulate a phishing email to help it evade a spam detector.

In this work, we present Adversarial Text Interceptor and Rewriter (ATINTER), a novel encoder-decoder model that intercepts and learns to rewrite adversarial inputs to make them non-adversarial for a downstream text classifier. We demonstrate the effectiveness of ATINTER using a T5 model as the general-purpose text rewriter, but our method is applicable to any transformer-based text generator/rewriter. The rewriter module serves as a pluggable component, enabling it to defend models that it was not explicitly trained to protect. Figure 1 below demonstrates this scenario.

Figure 1: Modular application of ATINTER, which is trained for defending a BERT classifier for the SST-2 dataset. Demonstrating transferability across models, ATINTER successfully defends a RoBERTa classifier on SST-2 without retraining. Similarly, ATINTER is successful in defending a BERT model for a news classification task on AGNews.

How does your research advance the state-of-the-art in the field of natural language processing?

Through extensive experimentation and comparison with existing methods, we show that ATINTER effectively removes adversarial perturbations, while consistently outperforming other defense approaches on several datasets for text classification. We experimented on four datasets, applying five attack mechanisms (Table 6 from the paper presented below) on the inputs to a downstream BERT base classifier. The results (Table 9 from the paper presented below) reveal that the ATINTER module is effective at providing better adversarial robustness (higher adversarial accuracy) than existing defense approaches, without compromising clean accuracy (i.e., the accuracy of the model on clean non-adversarial inputs).

Table 6: Summary of the black-box adversarial attacks: Comparing the adversarial attacks we use in this work along with related information such as attach type, human perceptibility, and an example input for each attack. The third column indicates whether a human can easily identify if textual input was modified or not based on grammar syntax, semantics, and other language rules.

Table 9: Summary of the main results. Absolute percentage change in Clean Accuracy and Adversarial Accuracies averaged over the five adversarial attacks.

When used as a pluggable component, ATINTER exhibits good transferability to new models and datasets without the need for retraining. Specifically, we find that ATINTER (T5-based rewriter), when trained to remove adversarial perturbations for a BERT sentiment classifier during inference on SST-2 dataset, also removes adversarial perturbations for another news classification model on AGNews, increasing adversarial robustness by more than 10%.

Multi-lingual and Multi-cultural Figurative Language Understanding (Findings of the ACL 2023)

Anubha Kabra (Carnegie Mellon University), Emmy Liu (Carnegie Mellon University), Simran Khanuja (Carnegie Mellon University), Alham Fikri Aji (MBZUAI), Genta Indra Winata (Bloomberg), Samuel Cahyawijaya (HKUST), Anuoluwapo Aremu (Masakhane), Perez Ogayo (Carnegie Mellon University), Graham Neubig (Carnegie Mellon University)

The 17th Workshop on Linguistic Annotation (LAW) | Poster Session 2, Thursday, July 13, 3:30-5:00 PM EDT

Front page of Findings of the ACL 2023 paper "Multi-lingual and Multi-cultural Figurative Language Understanding"

Please summarize your research. Why are your results notable?

Genta: Figurative language, such as metaphors, permeates human communication in different languages, but they are relatively understudied in NLP, especially in languages with a specific cultural context. Most datasets have been created in English to accelerate progress towards measuring and improving figurative language processing in language models (LMs). However, the use of figurative language is very cultural and related to societal experiences, making it challenging to be universally applicable.

In this paper, we build a figurative language inference dataset, MABL, for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili, and Yoruba. Our dataset shows that each language relies heavily on cultural and regional concepts for figurative expressions, with the highest overlap between languages that originate in the same region. Table 1 shows the example of figurative language samples in different languages.

We assess the abilities of multilingual LMs (i.e., LaBSE) and visualize the representation of the figurative language texts (see Figure 1). We can see the texts with the same topic are represented in a similar embedding space. We further interpret figurative language in zero-shot and few-shot settings. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and finetuning data. These findings emphasize the need for LMs to be exposed to a broader range of linguistic and cultural variation during training.

Table 1: Examples of figurative expressions and respective inferences from the collected data. Correct answers are highlighted in green.

Figure 1: UMAP visualization of the collected data. Sentence embeddings are obtained using LaBSE (Feng et al., 2020), a multilingual dual encoder model, optimized for cross-lingual retrieval. Refer to Section 4 for more details.

How does your research advance the state-of-the-art in the field of natural language processing?

This dataset plays a big role in the extensive understanding of figurative language in non-English languages. We find considerable variation in figurative language use across languages, particularly related to the unique objects that people invoke in their comparisons, finding significant differences in references to food, mythology, religion, famous figures, and events. This variation is likely due to cultural differences between the countries in which these languages are spoken.

We find that multilingual models have considerable room for improvement on this task, and that these cross-cultural shifts may play a significant role in the performance degradation from English. We encourage the NLP community to further examine the role that culture plays in language, and note that figurative language can be used as a testbed to examine cross-linguistic and cross-cultural variations.

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges (Findings of the ACL 2023)

Genta Indra Winata (Bloomberg), Alham Fikri Aji (MBZUAI), Zheng Xin Yong (Brown University), Thamar Solorio (Bloomberg)

Front page of Findings of the ACL 2023 paper "The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges"

Please summarize your research. Why are your results notable?

Genta: Code-Switching is a common phenomenon in written text and conversations, in which speakers alternate languages during a conversation. It has been studied over decades by the NLP research community. Initially, code-switching was intensively explored by leveraging linguistic theories. Currently, more machine learning-oriented approaches are used to develop models.

We introduce a comprehensive systematic survey on code-switching research in NLP to understand the progress of the past decades and to conceptualize the challenges and tasks on the topic of code-switching. Finally, we summarize the trends and findings and conclude with a discussion of future directions and open questions for further investigation.

We collected more than 400 papers published on open repositories, such as the ACL Anthology and ISCA Proceedings, and then manually coded these papers to collect coarse and fine-grained information on code-switching research in NLP, including languages covered, NLP tasks explored, and new and emerging trends. Figure 1 shows the number of publications under study over the years. We further categorized the code-switching languages to get a fine-grained classification (see Figure 2).

In addition, motivated by the fact that fields like linguistics, socio-linguistics, and related fields have studied code-switching since the early 1900s, we also investigate to what extent theoretical frameworks from these fields have influenced NLP approaches, and how the choice of methods has evolved over time. Finally, we discuss the most pressing research challenges and identify a path forward to continue advancing this exciting line of work. The area of NLP for code-switching data is thriving, covering an increasing number of language combinations and tasks, and it is clearly advancing from a niche field to a common research topic, thus making our comprehensive survey timely.

How does your research advance the state-of-the-art in the field of natural language processing?

This is the first-ever systematic survey on code-switching research in NLP – including speech processing – to explore the progress of the past decades and understand the existing challenges and tasks in the literature. We found interesting trends in terms of how researchers have moved from using linguistic theories to a more contemporary approach, such as statistical and deep learning techniques. And, many works suggested that code-switching is not limited to two languages, but also three or more languages. We expect this survey will provide valuable information to researchers new to the field, while also motivating additional work from researchers already engaged in NLP for code-switching data. We also hope this survey can provide guidance on how to start working on code-switching research.

Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning (Findings of the ACL 2023)

Genta Indra Winata (Bloomberg), Jane Xie (Bloomberg), Karhik Radhakrishnan (Bloomberg), Shijie Wu (Bloomberg), Xisen Jin (University of Southern California), Pengxiang Cheng (Bloomberg), Mayank Kulkarni (Amazon Alexa AI), Daniel Preoţiuc-Pietro (Bloomberg)

Front page of Findings of the ACL 2023 paper "Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning"

Please summarize your research. Why are your results notable?

Genta: Real-life multilingual systems should be able to efficiently incorporate new languages, as data distributions fed to the system evolve and shift over time (see Figure 1 for the set up). To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past.

In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks on WikiAnn and MASSIVE datasets. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.

Figure 1. Continual Learning Pipeline — Continual Learning Pipeline

We start by quantifying the extent to which forgetting happens when languages are presented to the model in sequence, identifying an up to 16% F1 drop compared to training using all the data that was mixed. We then apply our proposed method to the existing continual learning methods since they are orthogonal to each other. Across three different continual learning methods, we find that LR ADJUST helps further reduce the gap between a fully trained model and the continual learning setup.

We conduct analysis on the aspect of cross-lingual transfer in backward and forward directions to measure the influence of the continual learning on previous tasks and its ability in zero-shot learning respectively. Finally, we conduct analyses on the effects of catastrophic forgetting when first training on multiple languages jointly and when using a curriculum learning approach informed by language similarity.

Figure 2. Average F1 scores and standard deviations over 5 runs on WikiAnn.

How does your research advance the state-of-the-art in the field of natural language processing?

The research shows the critical importance of having a good strategy to address catastrophic forgetting when continually training a multilingual model. Our proposed method, LR ADJUST, enables us to improve the knowledge preservation from the learned tasks and lets the model keep learning in a life-long setting. Moreover, LR ADJUST is also orthogonal to existing continual learning techniques, showing the adaptability and scalability of the approach, which can be easily applied to other use cases, not limited to the multilingual setting. We hope this method can be helpful for further continual learning research.

Bloomberg’s AI Engineering Group & CTO Office Publish 11 NLP Research Papers at ACL 2023

Read more related stories