EEE 591/592 Seminar:
Beyond Pretraining Loss: Evaluating Value of Pretraining Data for Large Language Models at Scale
Berivan Işık, PhD
Google
Date/Time: Thursday, February 27, 2025 – 19:00-20:00 TSI
Place: Zoom
This is an online seminar. To request event details please send a message to department.
Abstract: The performance of Large Language Models (LLMs) is fundamentally dependent on the quality and quantity of their pretraining data. Understanding the impact of data choices and scale is crucial for optimizing model development and deployment. This talk will begin by highlighting the importance of accurately estimating the effect of data characteristics on LLM performance. We will then delve into scaling laws, which provide valuable insights into predicting pretraining loss as a function of pretraining data size and model parameters. However, pretraining loss alone does not fully capture the complex interplay between pretraining data and downstream task performance. To address this, we will explore our recent work extending beyond pretraining loss to directly predict downstream metric performance at scale. This approach provides a more comprehensive evaluation of value of pretraining data, enabling a nuanced understanding of how data influences real-world applications.
This talk will draw upon findings in our recent ICLR 2025 paper: Scaling Laws for Downstream Task Performance in Machine Translation.
Biography: Berivan Isik is a research scientist at Google, working on efficient and trustworthy AI. Her current interests are efficient training of large models, data valuation and scaling laws for LLMs, and unlearning. She completed her PhD at Stanford University in 2024. Her research was supported by Stanford Graduate Fellowship, Google Ph.D. Fellowship, and a Meta research grant.