Publications

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

V. Lialin, N. Shivagunde, S. Muckatira, Anna Rumshisky

Paper link

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models. In this paper, we explore low-rank training techniques as an alternative approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training. Furthermore, we observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Our findings shed light on the potential of low-rank training techniques and their implications for scaling laws.

Read More

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

V. Deshpande, D. Pechi, S. Thatte, V. Lialin, A. Rumshisky., Accepted to ACL 2023

Paper link

In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below 2.2×10^15 FLOPs. We also find that adding layers does not always benefit downstream performance.

Read More

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

N. Shivagunde, V. Lialin, A. Rumshisky

Paper link

Language model probing is often used to test specific capabilities of these models. However, conclusions from such studies may be limited when the probing benchmarks are small and lack statistical power. In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies. We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the extended datasets, seeing model performance dip 20-57% compared to the original smaller benchmarks. We observe high levels of negation sensitivity in models like BERT and ALBERT demonstrating that previous findings might have been skewed due to smaller test sets. Finally, we observe that while GPT3 has generated all the examples in ROLE-1500 is only able to solve 24.6% of them during probing.

Read More

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

V. Lialin, V. Deshpande, A. Rumshisky

Paper link

This paper presents a systematic overview and comparison of parameter-efficient fine-tuning methods covering over 40 papers published between February 2019 and February 2023. These methods aim to resolve the infeasibility and impracticality of fine-tuning large language models by only training a small set of parameters. We provide a taxonomy that covers a broad range of methods and present a detailed method comparison with a specific focus on real-life efficiency and fine-tuning multibillion-scale language models.

Read More

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

V. Lialin, S. Rawls, D. Chan, S. Ghosh, A. Rumshisky, W. Hamza; WACV 2023 Workshop on Multimodal Pre-Training

Paper link

Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR- VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings.

Read More

Learning to Ask Like a Physician

E. Lehman, V. Lialin, et al., Clinical NLP 2022

Paper link

Full list of authors: Eric Lehman, Vladislav Lialin, Katelyn Y. Legaspi, Anne Janelle R. Sy, Patricia Therese S. Pile, Nicole Rose I. Alberto, Richard Raymund R. Ragasa, Corinna Victoria M. Puyat, Isabelle Rose I. Alberto, Pia Gabrielle I. Alfonso, Marianne Taliño, Dana Moukheiber, Byron C. Wallace, Anna Rumshisky, Jenifer J. Liang, Preethi Raghavan, Leo Anthony Celi, Peter Szolovits

Existing question answering (QA) datasets derived from electronic health records (EHR) are artificially generated and consequently fail to capture realistic physician information needs. We present Discharge Summary Clinical Questions (DiSCQ), a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. We analyze this dataset to characterize the types of information sought by medical experts. We also train baseline models for trigger detection and question generation (QG), paired with unsupervised answer retrieval over EHRs. Our baseline model is able to generate high quality questions in over 62% of cases when prompted with human selected triggers. We release this dataset (and all code to reproduce baseline model results) to facilitate further research into realistic clinical QA and QG: this https URL.

Read More

Life after BERT: What do Other Muppets Understand about Language?

V. Lialin, K. Zhao, N. Shivagunde, A. Rumshisky., ACL 2022

Paper link

Existing pre-trained transformer analysis works usually focus only on one or two model families at a time, overlooking the variability of the architecture and pre-training objectives. In our work, we utilize the oLMpics bench- mark and psycholinguistic probing datasets for a diverse set of 29 models including T5, BART, and ALBERT. Additionally, we adapt the oLMpics zero-shot setup for autoregres- sive models and evaluate GPT networks of different sizes. Our findings show that none of these models can resolve compositional questions in a zero-shot fashion, suggesting that this skill is not learnable using existing pre-training objectives. Furthermore, we find that global model decisions such as architecture, directionality, size of the dataset, and pre-training objective are not predictive of a model’s linguistic capabilities.

Read More

Update Frequently, Update Fast: Retraining Semantic Parsing Systems in a Fraction of Time

V. Lialin, R. Goel, A. Simanovsky, A. Rumshisky, R. Shah, 2020

Paper link

Currently used semantic parsing systems deployed in voice assistants can require weeks to train. Datasets for these models often receive small and frequent updates, data patches. Each patch requires training a new model. To reduce training time, one can fine-tune the previously trained model on each patch, but naive fine-tuning exhibits catastrophic forgetting - degradation of the model performance on the data not represented in the data patch. In this work, we propose a simple method that alleviates catastrophic forgetting and show that it is possible to match the performance of a model trained from scratch in less than 10% of a time via fine-tuning. The key to achieving this is supersampling and EWC regularization. We demonstrate the effectiveness of our method on multiple splits of the Facebook TOP and SNIPS datasets.

Read More

Text is an Image: Augmentation via Embedding Mixing

K. Zhao, V. Lialin, A. Rumshisky., MIT PRIMES, 2020

Paper link

Data augmentation techniques are essential for computer vision, yielding signif- icant accuracy improvements with little engineering costs. However, data aug- mentation for text has always been tricky. Synonym replacement techniques re- quire a good thesaurus and domain-specific rules for synonym selection from the synset, while backtranslation techniques are computationally expensive and re- quire a good translation model for the language in interest. In this paper, we present simple text augmentation techniques on the embed- dings level, inspired by mixing-based image augmentations. These techniques are language-agnostic and require little to no hyperparameter tuning. We evaluate the augmentation techniques on IMDB and GLUE tasks, and the results show that the augmentations significantly improve the score of the RoBERTa model.

Read More

Injecting Hierarchy with U-Net Transformers

D. Donahue, V. Lialin, A. Rumshisky, 2019

Paper link

The Transformer architecture has become increasingly popular over the past two years, owing to its impressive performance on a number of natural language processing (NLP) tasks. However, all Transformer computations occur at the level of word representations and therefore, it may be argued that Transformer models do not explicitly attempt to learn hierarchical structure which is widely assumed to be integral to language. In the present work, we introduce hierarchical processing into the Transformer model, taking inspiration from the U-Net architecture, popular in computer vision for its hierarchical view of natural images. We empirically demonstrate that the proposed architecture outperforms both the vanilla Transformer and some strong baselines in the domain of chit-chat dialogue.

Read More

Named Entity Recognition in Noisy Domains

V. Malykh & V. Lyalin. ICAIAI 2018.

Paper link

Named Entity Recognition (NER) task is an important part for conversational AI. A typical user of a conversation system has no time to check the spelling or grammar in his or her utterances. Due to that user utterances contain typos and spelling errors, so the noise robustness should be considered as a significant aspect of NER task. In this work, we study noise robustness properties for variants of state of the art named entity recognition models on three languages, English on CoNLL'03 corpus, Russian, on Persons-1000 corpus and French, on CAp'2017 corpus, also, we demonstrate state of the art results for CAp'2017.

Read More

What Did You Say? On Classification of Noisy Texts.

V. Malykh & V. Lyalin. Neuroinformatics 2018.

Paper (in Russian)

A classic task of text classification was studied in many works, but current approaches mostly devoted to improvement of classification quality for what we call clean corpora, not containing typos. In this work we present results of modern classification models testing in the presence of noise for two languages – English and Russian.

Read More