Publications

For the complete list, please see my Google Scholar Profile.

Preprints

  1. ArXiv
    LoRA is All You Need for Safety Alignment of Reasoning LLMs
    Yihao Xue, and Baharan Mirzasoleiman
    arXiv preprint arXiv:2507.17075, Preprints
  2. ArXiv
    Theoretical Perspectives on Data Quality and Synergistic Effects in Pre-and Post-Training Reasoning Models
    Adel Javanmard, Baharan Mirzasoleiman, and Vahab Mirrokni
    arXiv preprint arXiv:2603.01293, Preprints
  3. ArXiv
    Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from k-Parity
    Jianhao Huang, and Baharan Mirzasoleiman
    arXiv preprint arXiv:2601.22450, Preprints
  4. ArXiv
    Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories
    Nilay Naharas*, Dang Nguyen*, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, and Baharan Mirzasoleiman
    arXiv preprint arXiv:2510.01454, Preprints
  5. ArXiv
    Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision
    Yihao Xue, Allan Zhang, Jianhao Huang, Amit Sahai, and Baharan Mirzasoleiman
    arXiv preprint arXiv:2602.00927, Preprints
  6. ArXiv
    Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs
    Tushaar Gangavarapu*, Jiping Li*, Christopher Vattheuer*, Zhangyang Wang, and Baharan Mirzasoleiman
    arXiv preprint arXiv:2602.00576, Preprints
  7. ArXiv
    Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
    Wenhan Yang., and Baharan Mirzasoleiman
    arXiv preprint arXiv:2505.24208, Preprints
  8. ArXiv
    Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection
    Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman
    arXiv preprint arXiv:2502.15845, Preprints
  9. ArXiv
    Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features
    Siddharth Joshi, Yu YangYihao XueWenhan Yang., and Baharan Mirzasoleiman
    arXiv preprint arXiv:2306.11957, Preprints

2026

  1. ICLR
    Understanding the Role of Training Data in Test-Time Scaling
    Adel Javanmard, Baharan Mirzasoleiman, and Vahab Mirrokni
    International Conference on Learning Representations (ICLR), 2026
  2. ICLR
    Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models
    Dang Nguyen, Jiping Li, Jinghao Zheng, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2026

2025

  1. DMLR
    MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
    Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, and Baharan Mirzasoleiman
    Journal of Data-centric Machine Learning Research (DMLR), 2025
  2. Synthetic Text Generation for Training Large Language Models via Gradient Matching
    Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2025
  3. Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions
    Yihao Xue, Jiping Li, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2025
  4. ICLR
    Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures
    Dang NguyenWenhan Yang., Rathul Anand, Yu Yang, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2025
  5. ICLR
    Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks
    Siddharth Joshi, Jiayi Ni, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2025

2024

  1. Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization
    Dang Nguyen, Paymon Haddad, Eric Gan, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2024
  2. SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
    Yu Yang, Siddhartha Mishra, Jeffery N. Chiang, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2024
  3. Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks
    Wenhan Yang., Jingdong Gao, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2024
  4. Few-shot Adaption to Distribution Shifts By Mixing Source and Target Embeddings
    Yihao Xue, Ali Payani, Yu Yang, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2024
  5. NeWRF: A Deep Learning Framework for Wireless Radiation Field Reconstruction and Channel Prediction
    Haofan Lu, Christopher Vattheuer, Baharan Mirzasoleiman, and Omid Abari
    International Conference on Machine Learning (ICML), 2024
  6. UAI
    Graph Contrastive Learning under Heterophily via Graph Filters
    Wenhan Yang., and Baharan Mirzasoleiman
    Conference on Uncertainty in Artificial Intelligence (UAI), 2024
  7. UAI
    Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise
    Yihao Xue, Kyle Whitecross, and Baharan Mirzasoleiman
    Conference on Uncertainty in Artificial Intelligence (UAI), 2024
    Spotlight presentation
  8. Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
    Siddharth Joshi, Arnav Jain, Ali Payani, and Baharan Mirzasoleiman
    International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
  9. Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias
    Yu Yang, Eric Gan, Gintare Karolina Dziugaite, and Baharan Mirzasoleiman
    International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
  10. ICLR
    Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift
    Yihao Xue, Siddharth Joshi, Dang Nguyen, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2024
  11. ICLR
    Investigating the Benefits of Projection Head for Representation Learning
    Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2024
  12. ICLR
    Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
    Xuxi Chen*, Yu Yang*, Zhangyang Wang, and Baharan Mirzasoleiman
    International Conference on Learning Representations (ICLR), 2024

2023

  1. Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks
    Wenhan Yang., Jingdong Gao, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2023
  2. Robust Learning with Progressive Data Expansion Against Spurious Correlation
    Yihe Deng*, Yu Yang*Baharan Mirzasoleiman, and Quanquan Gu
    Advances in Neural Information Processing Systems (NeurIPS), 2023
  3. J. Affect. Disord.
    Sleep, Brain Systems, and Persistent Stress in Early Adolescents During COVID-19: Insights from the ABCD Study
    Orsolya Kiss, Zihan Qu, Eva M. Müller-Oehring, Fiona C. Baker, and Baharan Mirzasoleiman
    Journal of Affective Disorders, 2023
  4. Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
    Yu Yang, Besmira Nushi, Hamid Palangi, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
  5. Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least
    Siddharth Joshi, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
  6. Which Features are Learned by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression
    Yihao Xue, Siddharth Joshi, Eric Gan, Pin-Yu Chen, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
    Oral presentation (top 2%)
  7. Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
    Yu Yang, Hao Kang, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2023
  8. HotStorage
    NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training
    Neha Prakriya, Yu YangBaharan Mirzasoleiman, Cho-Jui Hsieh, and Jason Cong
    ACM Workshop on Hot Topics in Storage and File Systems (HotStorage), 2023
  9. High Probability Bounds for Stochastic Continuous Submodular Maximization
    Evan Becker, Jingdong Gao, Ted Zadouri, and Baharan Mirzasoleiman
    International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
  10. ICDH
    A Self-supervised Framework for Improved Data-Driven Monitoring of Stress via Multi-modal Passive Sensing
    Shayan Fazeli, Lionel Levine, Mehrab Beikzadeh, Baharan Mirzasoleiman, Bita Zadeh, Tara Peris, and Majid Sarrafzadeh
    IEEE Conference on Digital Health (ICDH), 2023
  11. TKDE
    On the fairness of time-critical influence maximization in social networks
    Junaid Ali, Mahmoudreza Babaei, Abhijnan Chakraborty, Baharan Mirzasoleiman, Krishna Gummadi, and Adish Singla
    IEEE Transactions on Knowledge and Data Engineering (TKDE), 2023

2022

  1. Friendly Noise against Adversarial Noise: A Powerful Defense against Data Poisoning Attack
    Tian Yu Liu, Yu Yang, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2022
  2. Data-Efficient Augmentation for Training Neural Networks
    Tian Yu Liu, and Baharan Mirzasoleiman
    Advances in Neural Information Processing Systems (NeurIPS), 2022
  3. Not all poisons are created equal: Robust training against data poisoning
    Yu Yang, Tian Yu Liu, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2022
    Oral presentation (top 2%)
  4. Adaptive second order coresets for data-efficient machine learning
    Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2022
  5. Investigating why contrastive learning benefits robustness against label noise
    Yihao Xue, Kyle Whitecross, and Baharan Mirzasoleiman
    International Conference on Machine Learning (ICML), 2022
  6. Syn.Data4ML
    Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization
    Omead Pooladzandi, Pasha Khosravi, Erik Nijkamp, and Baharan Mirzasoleiman
    Neurips SyntheticData4ML Workshop, 2022
  7. BIBM
    Passive Monitoring of Physiological Precursors of Stress Leveraging Smartwatch Data
    Shayan Fazeli, Lionel Levine, Mehrab Beikzadeh, Baharan Mirzasoleiman, Bita Zadeh, Tara Peris, and Majid Sarrafzadeh
    IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2022
  8. EAAMO
    Towards Balanced Information Propagation in Social Media
    Mahmoudreza Babaei, Baharan Mirzasoleiman, Jungseock Joo, and Adrian Weller
    ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), 2022
  9. CompBio
    Purification of single-cell transcriptomics data with coreset selection
    Róbert Pálovics, Tony Wyss-Coray, and Baharan Mirzasoleiman
    ICML Workshop on Computational Biology (CompBio), 2022
  10. TempWeb
    Analytical Models for Motifs in Temporal Networks
    Alexandra Porter, Baharan Mirzasoleiman, and Jure Leskovec
    Temporal Web Analytics Workshop (TempWeb), 2022
  11. SNN
    Low Rank Pruning via Output Perturbation
    Yuhan Liu, Siddharth Joshi, and Baharan Mirzasoleiman
    Sparsity in Neural Networks Workshop (SNN), 2022
  12. Crosswalk: Fairness-enhanced node representation learning
    Ahmad Khajehnejad, Moein Khajehnejad, Mahmoudreza Babaei, Krishna P Gummadi, Adrian Weller, and Baharan Mirzasoleiman
    AAAI Conference on Artificial Intelligence (AAAI), 2022
  13. ICDE
    On the fairness of time-critical influence maximization in social networks
    Junaid Ali, Mahmoudreza Christiansen Babaei, Abhijnan Chakraborty, Baharan Mirzasoleiman, Krishna Gummadi, and Adish Singla
    IEEE International Conference on Data Engineering (ICDE), 2022

2020

  1. UAI
    Coresets for estimating means and mean square error with limited greedy samples
    Saeed Vahidian, Baharan Mirzasoleiman, and Alexander Cloninger
    Conference on Uncertainty in Artificial Intelligence (UAI), 2020
  2. Coresets for data-efficient training of machine learning models
    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec
    International Conference on Machine Learning (ICML), 2020
  3. Coresets for robust training of deep neural networks against noisy labels
    Baharan Mirzasoleiman, Kaidi Cao, and Jure Leskovec
    Advances in Neural Information Processing Systems (NeurIPS), 2020
  4. ICLR
    Selection via Proxy: Efficient Data Selection for Deep Learning
    Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia
    International Conference on Learning Representations (ICLR), 2020

2018

  1. Streaming non-monotone submodular maximization: Personalized video summarization on the fly
    Baharan Mirzasoleiman, Stefanie Jegelka, and Andreas Krause
    AAAI Conference on Artificial Intelligence (AAAI), 2018
  2. Dynamic network model from partial observations
    Elahe Ghalebi, Baharan Mirzasoleiman, Radu Grosu, and Jure Leskovec
    Advances in Neural Information Processing Systems (NeurIPS), 2018
    Spotlight presentation (top 3%)

2017

  1. Deletion-robust submodular maximization: Data summarization with “the right to be forgotten”
    Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause
    International Conference on Machine Learning (ICML), 2017
  2. Guaranteed non-convex optimization: Submodular maximization over continuous domains
    Andrew An Bian, Baharan Mirzasoleiman, Joachim Buhmann, and Andreas Krause
    Artificial Intelligence and Statistics (AISTATS), 2017

2016

  1. Learning sparse combinatorial representations via two-stage submodular maximization
    Eric Balkanski*, Baharan Mirzasoleiman*, Andreas Krause, and Yaron Singer
    International Conference on Machine Learning (ICML), 2016
  2. Fast constrained submodular maximization: Personalized data summarization
    Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi
    International Conference on Machine Learning (ICML), 2016
  3. Fast distributed submodular cover: Public-private data summarization
    Baharan Mirzasoleiman, Morteza Zadimoghaddam, and Amin Karbasi
    Advances in Neural Information Processing Systems (NeurIPS), 2016
  4. JMLR
    Distributed submodular maximization
    Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause
    The Journal of Machine Learning Research (JMLR), 2016

2015

  1. Lazier than lazy greedy
    Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause
    AAAI Conference on Artificial Intelligence (AAAI), 2015
  2. Distributed submodular cover: Succinctly summarizing massive data
    Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, and Andreas Krause
    Advances in Neural Information Processing Systems (NeurIPS), 2015
    Spotlight presentation (top 4%)

2014

  1. KDD
    Streaming submodular maximization: Massive data summarization on the fly
    Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause
    ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2014
  2. NetSciCom
    Modeling the impact of user awareness on immunization strategies
    Baharan Mirzasoleiman, Hamid R Rabiee, and Mostafa Salehi
    IEEE International Workshop on Network Science for Communication Networks (NetSciCom), 2014

2013

  1. SNAM
    Revenue maximization in social networks through discounting
    Mahmoudreza Babaei, Baharan Mirzasoleiman, Mahdi Jalili, and Mohammad Ali Safari
    Social Network Analysis and Mining (SNAM), 2013
  2. Distributed submodular maximization: Identifying representative elements in massive data
    Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause
    Advances in Neural Information Processing Systems (NeurIPS), 2013

2012

  1. Europhys.Lett.
    Immunizing complex networks with limited budget
    Baharan Mirzasoleiman, Mahmoudreza Babaei, and Mahdi Jalili
    Europhysics Letters, 2012

2011

  1. Phys.Rev.E
    Cascaded failures in weighted networks
    Baharan Mirzasoleiman, Mahmoudreza Babaei, Mahdi Jalili, and MohammadAli Safari
    Physical Review E, 2011
  2. PLoS
    Failure tolerance of motif structure in biological networks
    Baharan Mirzasoleiman, and Mahdi Jalili
    PLoS One, 2011
  3. ICC
    Reuse-Attack Mitigation in Wireless Sensor Networks
    Hossein Shafiei, Ahmad Khonsari, Baharan Mirzasoleiman, and Mohammad Ould-Khaoua
    IEEE International Conference on Communications (ICC), 2011
    Best paper award runner up

2009

  1. ISPA
    Utility proportional optimization flow control for overlay multicast
    Ali Jafari, Hosein Shafiei, Baharan Mirzasoleiman, and Ghodrat Sepidnam
    IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2009

1967

  1. wave-mechanics.gif
    Letters on wave mechanics
    1967

1956

  1. brownian-motion.gif
    Investigations on the Theory of the Brownian Movement
    Albert Einstein
    1956

1950

  1. The meaning of relativity
    Albert Einstein, and AH Taub
    American Journal of Physics, 1950

1935

  1. Can Quantum-Mechanical Description of Physical Reality Be Considered Complete?
    A. Einstein, B. Podolsky, and N. Rosen
    Phys. Rev., May 1935

1905

  1. Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen
    A. Einstein
    Annalen der physik, May 1905
  2. Ann. Phys.
    Un the movement of small particles suspended in statiunary liquids required by the molecular-kinetic theory 0f heat
    A. Einstein
    Ann. Phys., May 1905
  3. On the electrodynamics of moving bodies
    A. Einstein
    May 1905

Thesis

  1. Thesis
    Big data summarization using submodular functions
    Baharan Mirzasoleiman
    ETH Zurich, 2017, May Thesis