Bloomberg’s AI Researchers Publish 4 Research Papers at EMNLP 2024

November 11, 2024

At the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) in Miami, researchers from Bloomberg’s AI Strategy & Research and Quant Research teams in the Office of the CTO, and its AI Engineering group, are showcasing their expertise in natural language processing (NLP) by publishing two (2) papers at the main conference and another two (2) papers at BlackboxNLP 2024, a workshop co-located with EMNLP. Through these papers, the authors and their research collaborators highlight a variety of use cases and applications, novel approaches and improved models used in key tasks, and other advances to the state-of-the-art in the field of NLP.

We asked some of our authors to summarize their research and explain why their results were notable:

Academics Can Contribute to Domain-Specialized Language Models

Mark Dredze (Bloomberg/Johns Hopkins University), Genta Indra Winata (Capital One), Prabhanjan Kambadur (Bloomberg), Shijie Wu (Anthropic), Ozan İrsoy (Bloomberg), Steven Lu (Bloomberg), Vadim Dabravolski (Bloomberg), David S Rosenberg (Bloomberg), Sebastian Gehrmann (Bloomberg)

Poster Session B – Language Modeling 2 (Tuesday, November 12 @ 2:00 PM–3:30 PM EST)

Front page of paper published at EMNLP 2024 titled "Academics Can Contribute to Domain-Specialized Language Models"

Please summarize your research. Why are your results notable?

Sebastian: Over the last few years, the field of NLP has undergone a transformation. While researchers have traditionally specialized in specific tasks like parsing or translation, the underlying technology has unified the modeling approaches used to tackle them. The effect of scaling up language models has led to the point where commercially-available models dominate academic leaderboards. This has concentrated research around the creation and adaptation of general-purpose models to improve leaderboard standings. This focus on large general-purpose models excludes many academics and draws attention away from areas where they can make important contributions. As a result, there has been an ongoing debate about the role academia can play in a research environment largely dominated by industry.

With the goal of pointing out research contributions that do not depend on unattainable hardware, we discuss the trend that general-purpose models often underperform in highly specialized domains like financial services or healthcare. Domain-specific systems may make use of general-purpose models as a key ingredient, but require interdisciplinary collaboration to develop components that help verify the accuracy of responses, ensure attribution to the correct documents, and evaluate their performance in a robust manner. Academics are uniquely positioned for these interdisciplinary collaborations, and we thus advocate for a renewed focus on developing and evaluating domain-specific models.

How does your research advance the state-of-the-art in the field of natural language processing?

Our opinion paper aims to share insights from an industry perspective into how academic work can advance language technology. As industry researchers, we stand in both worlds and want to encourage a healthy and balanced exchange of ideas.

Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs [OUTSTANDING PAPER AWARD]

Alexander Spangher (USC), Nanyun Peng (UCLA), Sebastian Gehrmann (Bloomberg), Mark Dredze (Bloomberg)

Oral Presentations – Human-centered NLP 1 (Tuesday, November 12 @ 11 AM-12:30 PM EST)

Front page of paper published at EMNLP 2024, titled "Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs"

Please summarize your research. Why are your results notable?

Mark: When writing news articles, journalists use their creativity to explore different “angles” (i.e., the specific perspectives a reporter takes) for their stories. As large language models (LLMs) become more capable, they may be able to aid journalists in exploring and developing angles for coverage. LLMs can support planning decisions for articles, enabling reporters to explore topics in greater depth and from different perspectives. We want to both evaluate the ability of LLMs in this setting and ensure alignment with human values.

Reporters often write summaries of press releases that include background context and analysis of new information. To determine how LLMs could support writing these types of articles, our paper presents an analysis of a large corpus of these articles. We developed methods to identify news articles that challenge and contextualize press releases, and used these methods to assemble a dataset of 250k press releases and 650k articles covering them. We then used LLMs to generate alternative angles for the same press releases and compare them with decisions made by human journalists.

Our findings show that human-written news articles that challenge and contextualize press releases take more creative angles and use more informational sources. However, both the angles and sources that LLMs suggest are significantly less creative than humans. Nevertheless, LLMs align better with humans when recommending angles, compared to recommending informational sources.

How does your research advance the state-of-the-art in the field of natural language processing?

Sebastian: Algorithmic journalism has long been used to help journalists in their jobs. With LLMs, the number of tasks that can be (partially) automated has vastly increased, and now includes even some of the more creative tasks. However, as with every assistive tool, a journalist is still responsible for what they are writing, and it is important to understand how tools like LLMs can impact their writing process. Our study sheds light onto a core limitation of LLMs for journalism, demonstrating that they lack in creative tasks, and that, as of today, they cannot be relied on as a primary tool to create angles for news stories.

Make it happen here.

SEARCH NOW

Enhancing Question Answering on Charts Through Effective Pre-training Tasks

Ashim Gupta (University of Utah), Vivek Gupta (University of Pennsylvania), Shuo Zhang (Bloomberg), Yujie He (Bloomberg), Ning Zhang (Bloomberg), Shalin Shah (Bloomberg)

BlackboxNLP 2024 – Poster Sessions (Friday, November 15 @ 11:00 AM-12:30 PM and 3:30-4:30 PM EST)

Front page of paper published at BlackboxNLP 2024, a workshop co-located with EMNLP 2024, titled "Enhancing Question Answering on Charts Through Effective Pre-training Tasks"

Please summarize your research. Why are your results notable?

Shalin: We live in a world rich with visual information. While modern AI systems are increasingly multi-modal (i.e., they use visual and textual information), we uncover limitations of current chart-based question answering (QA) systems.

For example, in the image below, we show three different variations of the same chart. Each of these charts display the same (or similar) data, but with a different order of bars, or with an additional bar.

Our goal in this work is to perform such perturbations to charts and evaluate the chart understanding capabilities of state-of-the-art models.

To shed light on the shortcomings of current chart-based QA systems, this paper first undertakes a detailed checklist-based behavioral analysis (see table below). Choosing the models trained on the ChartQA dataset for the representative case study, we systematically evaluate model responses against examples constructed from the checklist of expected behaviors, allowing us to pinpoint areas in which these models falter. Our focus in this work entails two large, recently introduced chart pretrained models, MatCha and DePlot + LLM. While MatCha is an end-to-end chart-to-text pretrained model, DePlot + LLM is a pipelined approach in which DePlot first converts an input chart into its textual representation in the form of a table and then performs few-shot question answering via a Large Language Model (LLM).

Based on our systematic analysis, we propose a set of three pre-training tasks to improve the robustness of these models: Visual-Structure prediction, Summary Statistics prediction, and Numerical Operator prediction. Through evaluation on three chart question answering datasets (ChartQA, PlotQA, and OpenCQA), we find that models that are fine-tuned after using these pre-training tasks outperform the baseline model by more than 1.7 percentage points.

How does your research advance the state-of-the-art in the field of natural language processing?

Numerous studies have highlighted various robustness challenges in NLP models, including their over-sensitivity to minor perturbations as well as under-sensitivity to large changes. A prominent method for identifying these issues is through behavioral analysis using specifically designed checklist examples for text-only models. In this work, we build upon such approaches by applying it to multimodal models, developing a checklist that aids in identifying similar robustness concerns and then demonstrating absolute improvements when acted upon those shortcomings.

Can We Statically Locate Knowledge in Large Language Models? Financial Domain and Toxicity Reduction Case Studies

Jordi Armengol-Estapé (University of Edinburgh), Lingyu Li (Bloomberg), Sebastian Gehrmann (Bloomberg), Achintya Gopal (Bloomberg), David S Rosenberg (Bloomberg), Gideon S. Mann*, Mark Dredze (Bloomberg/Johns Hopkins University)

BlackboxNLP 2024 – Poster Sessions (Friday, November 15 @ 11:00 AM-12:30 PM and 3:30-4:30 PM EST)

Front page of paper published at BlackboxNLP 2024, a workshop co-located with EMNLP 2024, titled "Can We Statically Locate Knowledge in Large Language Models? Financial Domain and Toxicity Reduction Case Studies"

Please summarize your research. Why are your results notable?

Lingyu: LLMs contain vast amounts of information distributed across billions of parameters. While these models have the ability to produce the correct answer to a wide range of factual questions, little is known about the mechanisms and locations of how this knowledge is stored within a model’s parameters. If we knew more about how this information was stored, we could better update models as new information becomes available.

Our research introduces a method that can locate knowledge directly in the parameters of a LLM, without the need for forward or backward passes. Instead, our method interprets model parameters in an embedding space that is shared with word representations. We further developed a method to search for specific knowledge and pinpoint the exact model parameters responsible for storing it.

We verify our method by selecting question-answer tasks that require specific knowledge, and then disabling model parameters that are associated with this knowledge. We assess the method in both the financial domain and for toxicity reduction. Our results show that we can successfully locate knowledge, as evidenced by our ability to disable that knowledge, such as causing a model to “forget” CEO names and stock ticker symbols. Moreover, we find that the model’s knowledge is sufficiently distributed so that we can disable specific types of knowledge without affecting others (e.g., the model forgets CEO names but not capital cities of countries.)

**Figure 1:** Accuracy of various models when ablating using the embedding location method, for both the target tasks (CEOs and Tickers) and the control task (Capitals). We can see how the target (solid) lines are generally below the control (dashed) lines, as expected, across model scales and architectures.

We also demonstrate our techniques by identifying and removing toxic language information from a model. By locating and ablating only 0.25% of the parameters that we identify as being related to toxic language, we achieved a near 30% reduction in harmful outputs, without degrading the model’s ability to perform generic language tasks.

**Figure 2:** Results of toxicity unit ablation over various models. The left subplot shows the effectiveness of the embedding location method (solid lines) to reduce the LLM’s toxicity by comparing against a random ablation baseline(dashed lines). The right subplot shows the perplexity check on the ablated models given both toxic (RealToxicPrompts) and generic(wikipedia) data, where the toxic language modeling perplexity increases with minor impairment of the generic one.

These results highlight the potential for fine-grained control over an LLM’s knowledge, offering a more interpretable, efficient way to manage large-scale models.

How does your research advance the state-of-the-art in the field of natural language processing?

Our approach gives researchers new insights into the internal workings of LLMs, and a way to foster better transparency and provide a foundation for methods that can verify model behaviors. While traditional methods for understanding LLMs rely on running inputs through the model and observing outputs, our static approach allows researchers to identify where knowledge is stored without computationally expensive processes. This method is more efficient and offers new avenues for interpreting models at scale. Moreover, we show some promising results in terms of using our method for targeted knowledge removal, which allows for precise removal (or modification) of outdated or harmful information from the LLM.

Bloomberg’s AI Researchers Publish 4 Research Papers at EMNLP 2024

Academics Can Contribute to Domain-Specialized Language Models

Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs [OUTSTANDING PAPER AWARD]

Enhancing Question Answering on Charts Through Effective Pre-training Tasks

Can We Statically Locate Knowledge in Large Language Models? Financial Domain and Toxicity Reduction Case Studies

**work done while at Bloomberg

Read more related stories