Data-efficient machine learning

Large datasets have enabled over-parameterized neural networks to achieve unprecedented success. However, training such models, with millions or billions of parameters, on large data requires expensive computational resources, which consume substantial energy, leave a massive amount of carbon footprint, and often soon become obsolete and turn into e-waste. While there has been a persistent effort to improve the performance and reliability of machine learning models, their sustainability is often neglected.

To address the sustainability and efficiency of machine learning, one approach involves selecting the most relevant data for training. We address the above problem by proposing rigorous methods to find coresets for training machine learning models, in particular neural networks.

Checkout the following papers to know more:

  1. Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
    Siddharth Joshi, Arnav Jain, Ali Payani, and Baharan Mirzasoleiman
    International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
  2. ICLR
    Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
    Xuxi Chen*, Yu Yang*, Zhangyang Wang, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2024
  3. Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least
    International Conference on Machine Learning (ICML), 2023
  4. Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
    Yu Yang, Hao Kang, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
  5. HotStorage
    NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training
    Neha Prakriya, Yu YangBaharan Mirzasoleiman, Cho-Jui Hsieh, and Jason Cong
    ACM Workshop on Hot Topics in Storage and File Systems (HotStorage), 2023
  6. Data-Efficient Augmentation for Training Neural Networks
    Tian Yu Liu, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2022
  7. Adaptive second order coresets for data-efficient machine learning
    Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2022
  8. Coresets for data-efficient training of machine learning models
    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec
    International Conference on Machine Learning (ICML), 2020