Data-efficiency

Data-efficient machine learning

Large datasets have enabled over-parameterized neural networks to achieve unprecedented success. However, training such models, with millions or billions of parameters, on large data requires expensive computational resources, which consume substantial energy, leave a massive amount of carbon footprint, and often soon become obsolete and turn into e-waste. While there has been a persistent effort to improve the performance and reliability of machine learning models, their sustainability is often neglected.

To address the sustainability and efficiency of machine learning, one approach involves selecting the most relevant data for training. We address the above problem by proposing rigorous methods to find coresets for training machine learning models, in particular neural networks.

Checkout the following papers to know more:

  1. ICLR
    Understanding the Role of Training Data in Test-Time Scaling
    Adel Javanmard, Baharan Mirzasoleiman, and Vahab Mirrokni
    International Conference on Learning Representations (ICLR), 2026
  2. ICLR
    Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models
    Dang Nguyen, Jiping Li, Jinghao Zheng, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2026
  3. ICLR
    Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures
    Dang NguyenWenhan Yang., Rathul Anand, Yu Yang, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2025
  4. ICLR
    Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks
    Siddharth Joshi, Jiayi Ni, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2025
  5. Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
    Siddharth Joshi, Arnav Jain, Ali Payani, and Baharan Mirzasoleiman
    International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
  6. ICLR
    Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
    Xuxi Chen*, Yu Yang*, Zhangyang Wang, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2024
  7. Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least
    Siddharth Joshi, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
  8. Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
    Yu Yang, Hao Kang, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
  9. HotStorage
    NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training
    Neha Prakriya, Yu YangBaharan Mirzasoleiman, Cho-Jui Hsieh, and Jason Cong
    ACM Workshop on Hot Topics in Storage and File Systems (HotStorage), 2023
  10. Data-Efficient Augmentation for Training Neural Networks
    Tian Yu Liu, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2022
  11. Adaptive second order coresets for data-efficient machine learning
    Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2022
  12. Coresets for data-efficient training of machine learning models
    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec
    International Conference on Machine Learning (ICML), 2020