Data-efficient machine learning

Large datasets have enabled over-parameterized neural networks to achieve unprecedented success. However, training such models, with millions or billions of parameters, on large data requires expensive computational resources, which consume substantial energy, leave a massive amount of carbon footprint, and often soon become obsolete and turn into e-waste. While there has been a persistent effort to improve the performance and reliability of machine learning models, their sustainability is often neglected.

To address the sustainability and efficiency of machine learning, one approach involves selecting the most relevant data for training. We address the above problem by proposing rigorous methods to find coresets for training machine learning models, in particular neural networks.

Checkout the following papers to know more:

  1. Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least
    International Conference on Machine Learning (ICML), 2023
  2. Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
    Yu Yang, Hao Kang, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
  3. HotStorage
    NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training
    Neha Prakriya, Yu YangBaharan Mirzasoleiman, Cho-Jui Hsieh, and Jason Cong
    ACM Workshop on Hot Topics in Storage and File Systems (HotStorage), 2023
  4. Data-Efficient Augmentation for Training Neural Networks
    Tian Yu Liu, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2022
  5. Adaptive second order coresets for data-efficient machine learning
    Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2022
  6. Coresets for data-efficient training of machine learning models
    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec
    International Conference on Machine Learning (ICML), 2020