Publications | Baharan Mirzasoleiman

Preprints

ArXiv
LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue, and Baharan Mirzasoleiman

arXiv preprint arXiv:2507.17075, Preprints

Abs arXiv Bib

Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the "Safety Tax". In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs – with safety levels comparable to full-model fine-tuning – without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such overlap – via regularization or during weight merging – and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.
@article{xue25lora, title = {LoRA is All You Need for Safety Alignment of Reasoning LLMs}, author = {Xue, Yihao and Mirzasoleiman, Baharan}, journal = {arXiv preprint arXiv:2507.17075}, year = {Preprints} }
ArXiv
Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision

Yihao Xue, Allan Zhang, Jianhao Huang, Amit Sahai, and Baharan Mirzasoleiman

arXiv preprint arXiv:2602.00927, Preprints

Abs arXiv Bib

Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models that can solve complex problems previously out of reach. Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence. This makes reasoning length an important scaling knob. In this work, we identify a novel phenomenon (both theoretically and experimentally): under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length (e.g., the token budget in RL, or the loop count in looped Transformers) increases, even after in-distribution (ID) performance has saturated. This suggests that robustness may require a larger budget than ID validation alone would indicate. We provide theoretical explanations via two mechanisms: (i) self-iteration can induce a stronger inductive bias in the hypothesis class, reshaping ID-optimal solutions in ways that improve OOD generalization; and (ii) when shortcut solutions that work for ID samples but not for OOD samples persist in the hypothesis class, regularization can reduce the learned solution’s reliance on these shortcuts as the number of self-iterations increases. We complement the theory with empirical evidence from two realizations of scaling training-time reasoning length: increasing the number of loops in looped Transformers on a synthetic task, and increasing token budgets during RL fine-tuning of LLMs on mathematical reasoning.
@article{xue26rl, title = {Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision}, author = {Xue, Yihao and Zhang, Allan and Huang, Jianhao and Sahai, Amit and Mirzasoleiman, Baharan}, journal = {arXiv preprint arXiv:2602.00927}, year = {Preprints} }
ArXiv
Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs

Tushaar Gangavarapu*, Jiping Li*, Christopher Vattheuer*, Zhangyang Wang, and Baharan Mirzasoleiman

arXiv preprint arXiv:2602.00576, Preprints

Abs arXiv Bib

Can modifying the training data distribution guide optimizers toward solutions with improved generalization when training large language models (LLMs)? In this work, we theoretically analyze an in-context linear regression model with multi-head linear self-attention, and compare the training dynamics of two gradient based optimizers, namely gradient descent (GD) and sharpness-aware minimization (SAM), the latter exhibiting superior generalization properties but is prohibitively expensive for training even medium-sized LLMs. We show, for the first time, that SAM induces a lower simplicity bias (SB)-the tendency of an optimizer to preferentially learn simpler features earlier in training-and identify this reduction as a key factor underlying its improved generalization performance. Motivated by this insight, we demonstrate that altering the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization. Our extensive experiments show that our strategy improves the performance of multiple LLMs-including Phi2-2.7B , Llama3.2-1B, Gemma3-1B-PT, and Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon on mathematical reasoning tasks.
@article{gangavarapu26synthetic, title = {Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs}, author = {Gangavarapu*, Tushaar and Li*, Jiping and Vattheuer*, Christopher and Wang, Zhangyang and Mirzasoleiman, Baharan}, journal = {arXiv preprint arXiv:2602.00576}, year = {Preprints} }
ArXiv
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

Wenhan Yang., and Baharan Mirzasoleiman

arXiv preprint arXiv:2505.24208, Preprints

Abs arXiv Bib

Ensuring Vision-Language Models (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs’ safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs’ safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.
@article{yang25safety, title = {Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap}, author = {Yang., Wenhan and Mirzasoleiman, Baharan}, journal = {arXiv preprint arXiv:2505.24208}, year = {Preprints} }
ArXiv
Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features

Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang., and Baharan Mirzasoleiman

arXiv preprint arXiv:2306.11957, Preprints

Abs arXiv Bib Code Website

Deep neural networks often exploit (spurious) features that are present in the majority of examples within a class during training. This leads to poor worst-group test accuracy i.e. poor accuracy for minority groups that lack these spurious features. Despite the growing body of recent efforts to address spurious correlations (SC), several challenging settings remain unexplored. In this work, we propose studying methods to mitigate SC in settings with 1) spurious features that are learned more slowly, 2) a larger number of classes and 3) a larger number of groups. We introduce two new datasets, SPUCOANIMALS and SPUCOSUN, to facilitate this study and conduct a systematic benchmarking of 8 state-of-the-art (SOTA) methods across a total of 5 vision datasets, training over 5K models. Through this, we highlight how existing group inference methods struggle in the presence of spurious features that are learned later in training. Additionally, we demonstrate how all existing methods struggle in settings with more groups and/or classes. Finally, we show the importance of careful model selection (hyperparameter tuning) in extracting optimal performance, especially in the more challenging settings we introduced, and propose more cost-efficient strategies for model selection. Overall, through extensive and systematic experiments, this work uncovers a suite of new challenges and opportunities for improving worst-group generalization in the presence of spurious features.
@article{joshi2023spuco, title = {Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features}, author = {Joshi, Siddharth and Yang, Yu and Xue, Yihao and Yang., Wenhan and Mirzasoleiman, Baharan}, journal = {arXiv preprint arXiv:2306.11957}, year = {Preprints}, spurious = {true} }

2026

COLM
Learning as Reasoning Unfolds: Progressive Rollout Allocation for Efficient Reinforcement Learning

Heyang Jiang, Henry Liu, and Baharan Mirzasoleiman

Conference on Language Modeling (COLM), 2026

Abs Bib PDF

Reinforcement learning with verifiable rewards (RLVR) has emerged as a highly effective framework for improving LLM reasoning, with methods such as GRPO among its most successful instantiations. However, GRPO relies on repeated generation of long chain-of-thought rollouts. Training time scales with the number of rollouts, a large fraction of which are uninformative. Thus, GRPO is computationally expensive and unstable. To mitigate this, existing approaches either generate a larger pool of rollouts and filter the most informative prompts, or leverage historical signals for filtering at later stages of training. These strategies offer modest performance gains, but slow down the overall process. To address this, we propose VarIance Guided Online Rollout allocation (VIGOR) which instead of allocating a fixed rollout budget per example, begins with a small number of rollouts for all examples in a batch and iteratively allocates additional rollouts to those with the highest group reward variance until a fixed total rollout budget is reached. Theoretically, we show that under the binary-reward setting of GRPO, within-group reward variance directly controls the gradient magnitude and speeds up training. Experiments on three model scales show that VIGOR reaches the target accuracy with up to 3.6 fewer rollouts, outperforms GRPO and baselines by up to 2.4%, and further improves training stability and final performance under extended training.
@article{huang26multi, title = {Learning as Reasoning Unfolds: Progressive Rollout Allocation for Efficient Reinforcement Learning}, author = {Jiang, Heyang and Liu, Henry and Mirzasoleiman, Baharan}, journal = {Conference on Language Modeling (COLM)}, year = {2026} }
COLM
How Transformers Learn to Plan via Multi-Token Prediction

Jianhao Huang, Zhanpeng Zhou, Renqiu Xia, Baharan Mirzasoleiman, Weijie Su, and Wei Huang

Conference on Language Modeling (COLM), 2026

Abs arXiv Bib

While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.
@article{huang26multj, title = {How Transformers Learn to Plan via Multi-Token Prediction}, author = {Huang, Jianhao and Zhou, Zhanpeng and Xia, Renqiu and Mirzasoleiman, Baharan and Su, Weijie and Huang, Wei}, journal = {Conference on Language Modeling (COLM)}, year = {2026} }
TMLR
Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection

Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman

Transactions on Machine Learning Research (TMLR), 2026

Abs Bib PDF

Large Language Models (LLMs) often hallucinate, limiting their reliability in sensitive applications. In black-box settings, several self-consistency-based techniques have been proposed for hallucination detection. We empirically show that these methods perform nearly as well as a supervised (black-box) oracle, leaving limited room for further gains within this paradigm. To address this limitation, we explore cross-model consistency checking between the target model and an additional verifier LLM. With this extra information, we observe improved oracle performance compared to purely self-consistency-based methods. We then propose a budget-friendly, two-stage detection algorithm that calls the verifier model only for a subset of cases. It dynamically switches between self-consistency and cross-consistency based on an uncertainty interval of the self-consistency classifier. We provide a geometric interpretation of consistency-based hallucination detection methods through the lens of kernel mean embeddings, offering deeper theoretical insights. Extensive experiments on QA-style hallucination detection benchmarks show that this approach maintains high detection performance while significantly reducing computational cost.
@article{xue26verify, title = {Verify when Uncertain: Beyond Self-Consistency in Black Box Hallucination Detection}, author = {Xue, Yihao and Greenewald, Kristjan and Mroueh, Youssef and Mirzasoleiman, Baharan}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2026} }
ICML
Theoretical Perspectives on Data Quality and Synergistic Effects in Pre-and Post-Training Reasoning Models

Adel Javanmard, Baharan Mirzasoleiman, and Vahab Mirrokni

International Conference on Machine Learning (ICML), 2026

Abs Bib PDF

Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: (i) balanced pretraining data can induce latent capabilities later activated during post-training, and (ii) SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.
@article{javanmard26prepost, title = {Theoretical Perspectives on Data Quality and Synergistic Effects in Pre-and Post-Training Reasoning Models}, author = {Javanmard, Adel and Mirzasoleiman, Baharan and Mirrokni, Vahab}, journal = {International Conference on Machine Learning (ICML)}, year = {2026} }
ICML
Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from k-Parity

Jianhao Huang, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2026

Abs Bib PDF

Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In this work, we investigate these properties within the setting of the k-parity problem (computing the XOR sum of k relevant bits), where neural networks typically exhibit grokking—a prolonged plateau of chance-level performance followed by sudden generalization. We theoretically decompose the Masked Diffusion (MD) objective into a Signal regime which drives feature learning, and a Noise regime which serves as an implicit regularizer. By training nanoGPT using MD objective on the k-parity problem, we demonstrate that MD objective fundamentally alters the learning landscape, enabling rapid and simultaneous generalization without experiencing grokking. Furthermore, we leverage our theoretical insights to optimize the distribution of the mask probability in the MD objective. Our method significantly improves perplexity for 50M-parameter models and achieves superior results across both pre-training from scratch and supervised fine-tuning. Specifically, we observe performance gains peaking at 8.8% and 5.8%, respectively, on 8B-parameter models, confirming the scalability and effectiveness of our framework in large-scale masked diffusion language model regimes.
@article{huang26tuning, title = {Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity}, author = {Huang, Jianhao and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2026} }
ICML
Reasoning Quality Emerges Early: Data Curation for Reasoning Models

Henry Jin, Wenhan Yang., Meysam Ghaffari, Carlos Morato, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2026

Abs Bib PDF Website

Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating high-quality SFT data rely heavily on strong reasoning models to filter examples based on diversity and difficulty, making the curation process costly while often yielding suboptimal data quality. In this work, we show that diverse and challenging reasoning examples can be identified using only the initial reasoning tokens. Specifically, we demonstrate that difficult problems can be reliably detected based on the loss of the first 100 reasoning tokens evaluated at a randomly perturbed checkpoint of the pretrained model. We further show that examples exhibiting similar loss patterns over their first 1k reasoning tokens across a small number of perturbed checkpoints extrapolating along the fine-tuning trajectory provably induce similar gradients. We validate our approach through extensive experiments on fine-tuning Qwen2.5-7B and Llama3.1-8B models on the M23K medical reasoning and OpenThoughts-Math datasets. Our method outperforms existing baselines by up to 1.7% while being 91% more token efficient. Our repo is available at \urlhttps://bigml-cs-ucla.github.io/TEMP-project-page/
@article{jin26reasoning, title = {Reasoning Quality Emerges Early: Data Curation for Reasoning Models}, author = {Jin, Henry and Yang., Wenhan and Ghaffari, Meysam and Morato, Carlos and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2026} }
ICML
Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories

Nilay Naharas*, Dang Nguyen*, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2026

Abs Bib PDF Code Website

Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose \alg, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, \alg effectively removes redundancy in large-scale LVLM training data. Extensive experiments across 4 target models, 2 proxy models, and 2 datasets show that \alg consistently outperforms 10 baseline methods. Moreover, \alg can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2\times. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project’s website can be found at \urlhttps://bigml-cs-ucla.github.io/XMAS-project-page/.
@article{naharas26data, title = {Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories}, author = {Naharas*, Nilay and Nguyen*, Dang and Bulut, Nesihan and Bateni, Mohammadhossein and Mirrokni, Vahab and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2026} }
ICLR
Understanding the Role of Training Data in Test-Time Scaling

Adel Javanmard, Baharan Mirzasoleiman, and Vahab Mirrokni

International Conference on Learning Representations (ICLR), 2026

Abs Bib PDF

Test-time scaling improves the reasoning capabilities of large language models (LLMs) by allocating extra compute to generate longer Chains-of-Thoughts (CoTs). This enables models to tackle more complex problem by breaking them down into additional steps, backtracking, and correcting mistakes. Despite its strong performance–demonstrated by OpenAI’s o1 and DeepSeek R1, the conditions in the training data under which long CoTs emerge, and when such long CoTs improve the performance, remain unclear. In this paper, we study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression. Our analysis provides a theoretical explanation for several intriguing observations: First, at any fixed test error, increasing test-time compute allows us to reduce the number of in-context examples (context length) in training prompts. Second, if the skills required to solve a downstream task are not sufficiently present in the training data, increasing test-time compute can harm performance. Finally, we characterize task hardness via the smallest eigenvalue of its feature covariance matrix and show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling. We confirm our findings with experiments on large, nonlinear transformer architectures.
@article{javanmard2026data, title = {Understanding the Role of Training Data in Test-Time Scaling}, author = {Javanmard, Adel and Mirzasoleiman, Baharan and Mirrokni, Vahab}, journal = {International Conference on Learning Representations (ICLR)}, year = {2026} }
ICLR
Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models

Dang Nguyen, Jiping Li, Jinghao Zheng, and Baharan Mirzasoleiman

International Conference on Learning Representations (ICLR), 2026

Abs Bib PDF

Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training with faithful images-containing same features but different noise-outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts generalization by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.
@article{nguyen2026synthetic, title = {Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models}, author = {Nguyen, Dang and Li, Jiping and Zheng, Jinghao and Mirzasoleiman, Baharan}, journal = {International Conference on Learning Representations (ICLR)}, year = {2026}, efficient = {true} }

2025

TMLR
Occam’s Razor for SSL: Memory-Efficient Parametric Instance Discrimination

Eric Gan, Patrik Reizinger, Alice Bizeul, Attila Juhos, Mark Ibrahim, Randall Balestriero, David Klindt, Wieland Brendel, and Baharan Mirzasoleiman

Transactions on Machine Learning Research (TMLR), 2025

Abs Bib PDF

Self-supervised learning (SSL) is the prevalent paradigm for representation learning often relying on pairwise similarity between multiple augmented views of each example. Numerous learning methods with various complexities such as gradient stopping, negative sampling, projectors, additional regularization terms, were introduced in the past years. These methods can be effective, but they require careful hyperparameter tuning, have increased computational and memory requirements and struggle with latent dimensionality collapse. Furthermore, complexities such as gradient stopping make them hard to analyse theoretically and confound the essential components of SSL. We introduce a simple parametric instance discrimination method, called Datum IndEx as its Target (DIET). DIET has a single computational branch, without explicit negative sampling, gradient stopping or other hyperparameters. We empirically demonstrate that DIET (1) can be implemented in a memory-efficient way; (2) achieves competitive performance with state-of-the-art SSL methods on small-scale datasets; and (3) is robust to hyperparameters such as batch size. We uncover tight connections to Spectral Contrastive Learning in the lazy training regime, leading to practical insights about the role of feature normalization. Compared to SimCLR or VICReg, DIET also has higher-rank embeddings on CIFAR100 and TinyImageNet, suggesting that DIET captures more latent information.
@article{gan2025occam, title = {Occam’s Razor for SSL: Memory-Efficient Parametric Instance Discrimination}, author = {Gan, Eric and Reizinger, Patrik and Bizeul, Alice and Juhos, Attila and Ibrahim, Mark and Balestriero, Randall and Klindt, David and Brendel, Wieland and Mirzasoleiman, Baharan}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2025} }
DMLR
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation

Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, and Baharan Mirzasoleiman

Journal of Data-centric Machine Learning Research (DMLR), 2025

Abs arXiv Bib Code

Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.
@article{joshi25mmgen, title = {MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation}, author = {Joshi, Siddharth and Nushi, Besmira and Balachandran, Vidhisha and Chandrasekaran, Varun and Vineet, Vibhav and Joshi, Neel and Mirzasoleiman, Baharan}, journal = {Journal of Data-centric Machine Learning Research (DMLR)}, year = {2025} }
ICML
Synthetic Text Generation for Training Large Language Models via Gradient Matching

Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2025

Abs arXiv Bib

Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that guarantees the convergence and performance of LLMs during fine-tuning on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text can guarantee convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data. Experiments on various classification tasks confirm the effectiveness of our proposed approach.
@article{nguyen25synthetic, title = {Synthetic Text Generation for Training Large Language Models via Gradient Matching}, author = {Nguyen, Dang and Li, Zeman and Bateni, Mohammadhossein and Mirrokni, Vahab and Razaviyayn, Meisam and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2025} }
ICML
Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

Yihao Xue, Jiping Li, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2025

Abs arXiv Bib

Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models’ internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.
@article{xue25w2s, title = {Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions}, author = {Xue, Yihao and Li, Jiping and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2025} }
ICLR
Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures

Dang Nguyen, Wenhan Yang., Rathul Anand, Yu Yang, and Baharan Mirzasoleiman

International Conference on Learning Representations (ICLR), 2025

Abs Bib PDF

Training with larger mini-batches improves the convergence rate and can yield superior performance. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs), due to the large GPU memory requirement. To address this problem, an effective approach is finding small minibatch coresets that closely match the gradient of larger mini-batches. However, this approach becomes infeasible and ineffective for LLMs, due to the highly imbalanced mixture of sources in language data, use of the Adam optimizer, and the very large gradient dimensionality of LLMs. In this work, we address the above challenges by proposing Coresets for Training LLMs (CoLM). First, we show that mini-batch coresets found by gradient matching do not contain representative examples of the small sources w.h.p., and thus including all examples of the small sources in the mini-batch coresets is crucial for optimal performance. Second, we normalize the gradients by their historical exponential to find mini-batch coresets for training with Adam. Finally, we leverage zeroth-order methods to find smooth gradient of the last V -projection matrix and sparsify it to keep the dimensions with the largest normalized gradient magnitude. We apply CoLM to finetuning Phi-2, Phi-3, Zephyr, and Llama-3 models with LoRA on MathInstruct and SuperGLUE benchmark. Remarkably, CoLM reduces the memory requirement of fine-tuning by 2x and even outperforms training with 4x larger mini-batches. Moreover, CoLM seamlessly integrates with existing memory-efficient training methods like LoRA, further reducing the memory requirements of training LLMs.
@article{nguyen2025memory, title = {Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures}, author = {Nguyen, Dang and Yang., Wenhan and Anand, Rathul and Yang, Yu and Mirzasoleiman, Baharan}, journal = {International Conference on Learning Representations (ICLR)}, year = {2025}, efficient = {true} }
ICLR
Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks

Siddharth Joshi, Jiayi Ni, and Baharan Mirzasoleiman

International Conference on Learning Representations (ICLR), 2025

Abs Bib PDF Code

Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naïve application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data. Code at https://github.com/BigML-CS-UCLA/MKDT
@article{joshi2025distillation, title = {Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks}, author = {Joshi, Siddharth and Ni, Jiayi and Mirzasoleiman, Baharan}, journal = {International Conference on Learning Representations (ICLR)}, year = {2025}, efficient = {true} }

2024

NeurIPS
Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

Dang Nguyen, Paymon Haddad, Eric Gan, and Baharan Mirzasoleiman

Advances in Neural Information Processing Systems (NeurIPS), 2024

Abs Bib PDF

Can we modify the training data distribution to encourage the underlying optimization method toward finding solutions with superior generalization performance on in-distribution data? In this work, we approach this question for the first time by comparing the inductive bias of gradient descent (GD) with that of sharpness-aware minimization (SAM). By studying a two-layer CNN, we prove that SAM learns easy and difficult features more uniformly, particularly in early epochs. That is, SAM is less susceptible to simplicity bias compared to GD. Based on this observation, we propose USEFUL, an algorithm that clusters examples based on the network output early in training and upsamples examples with no easy features to alleviate the pitfalls of the simplicity bias. We show empirically that modifying the training data distribution in this way effectively improves the generalization performance on the original data distribution when training with (S)GD by mimicking the training dynamics of SAM. Notably, we demonstrate that our method can be combined with SAM and existing data augmentation strategies to achieve, to the best of our knowledge, state-of-the-art performance for training ResNet18 on CIFAR10, STL10, CINIC10, Tiny-ImageNet; ResNet34 on CIFAR100; and VGG19 and DenseNet121 on CIFAR10.
@article{nguyen2024data, title = {Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization}, author = {Nguyen, Dang and Haddad, Paymon and Gan, Eric and and Mirzasoleiman, Baharan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2024} }
NeurIPS
SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Yu Yang, Siddhartha Mishra, Jeffery N. Chiang, and Baharan Mirzasoleiman

Advances in Neural Information Processing Systems (NeurIPS), 2024

Abs Bib PDF Code

Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
@article{yang2024s2l, title = {SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models}, author = {Yang, Yu and Mishra, Siddhartha and Chiang, Jeffery N. and and Mirzasoleiman, Baharan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2024} }
ICML
Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

Wenhan Yang., Jingdong Gao, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2024

Abs Bib PDF Code

Contrastive Language-Image Pre-training (CLIP) on large image-caption datasets has achieved remarkable success in zero-shot classification and enabled transferability to new domains. However, CLIP is extremely more vulnerable to targeted data poisoning and backdoor attacks, compared to supervised learning. Perhaps surprisingly, poisoning 0.0001% of CLIP pre-training data is enough to make targeted data poisoning attacks successful. This is four orders of magnitude smaller than what is required to poison supervised models. Despite this vulnerability, existing methods are very limited in defending CLIP models during pre-training. In this work, we propose a strong defense, SAFECLIP, to safely pre-train CLIP against targeted data poisoning and backdoor attacks. SAFECLIP warms up the model by applying unimodal contrastive learning (CL) on image and text modalities separately. Then, it divides the data into safe and risky sets, by applying a Gaussian Mixture Model to the cosine similarity of image-caption pair representations. SAFECLIP pre-trains the model by applying the CLIP loss to the safe set and applying unimodal CL to image and text modalities of the risky set separately. By gradually increasing the size of the safe set during pre-training, SAFECLIP effectively breaks targeted data poisoning and backdoor attacks without harming the CLIP performance. Our extensive experiments on CC3M, Visual Genome and MSCOCO demonstrate that SAFECLIP significantly reduces the success rate of targeted data poisoning attacks from 93.75% to 0% and that of various backdoor attacks from up to 100% to 0%, without harming CLIP’s performance.
@article{yang2023safe, title = {Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks}, author = {Yang., Wenhan and Gao, Jingdong and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2024}, poison = {true} }
ICML
Few-shot Adaption to Distribution Shifts By Mixing Source and Target Embeddings

Yihao Xue, Ali Payani, Yu Yang, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2024

Abs Bib PDF

Pretrained machine learning models need to be adapted to distribution shifts when deployed in new target environments. When obtaining labeled data from the target distribution is expensive, few-shot adaptation with only a few examples from the target distribution becomes essential. In this work, we propose MixPro, a lightweight and highly data-efficient approach for few-shot adaptation. MixPro first generates a relatively large dataset by mixing (linearly combining) pre-trained embeddings of large source data with those of the few target examples. This process preserves important features of both source and target distributions, while mitigating the specific noise in the small target data. Then, it trains a linear classifier on the mixed embeddings to effectively adapts the model to the target distribution without overfitting the small target data. Theoretically, we demonstrate the advantages of MixPro over previous methods. Our experiments, conducted across various model architectures on 8 datasets featuring different types of distribution shifts, reveal that MixPro can outperform baselines by as much as 7%, with only 2-4 target examples.
@article{xue2024fewshot, title = {Few-shot Adaption to Distribution Shifts By Mixing Source and Target Embeddings}, author = {Xue, Yihao and Payani, Ali and Yang, Yu and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2024}, spurious = {true} }
ICML
NeWRF: A Deep Learning Framework for Wireless Radiation Field Reconstruction and Channel Prediction

Haofan Lu, Christopher Vattheuer, Baharan Mirzasoleiman, and Omid Abari

International Conference on Machine Learning (ICML), 2024

Abs Bib PDF Code

We present NeWRF, a deep learning framework for predicting wireless channels. Wireless channel prediction is a long-standing problem in the wireless community and is a key technology for improving the coverage of wireless network deployments. Today, a wireless deployment is evaluated by a site survey which is a cumbersome process requiring an experienced engineer to perform extensive channel measurements. To reduce the cost of site surveys, we develop NeWRF, which is based on recent advances in Neural Radiance Fields (NeRF). NeWRF trains a neural network model with a sparse set of channel measurements, and predicts the wireless channel accurately in any location in the site. We introduce a series of techniques that integrate wireless propagation properties into the NeWRF framework to account for the fundamental differences between the behavior of light and wireless signals. We conduct extensive evaluations of our framework and show that our approach can accurately predict channels at unvisited locations with significantly lower measurement density than the prior state-of-the-art.
@article{lu2024newrf, title = {NeWRF: A Deep Learning Framework for Wireless Radiation Field Reconstruction and Channel Prediction}, author = {Lu, Haofan and Vattheuer, Christopher and Mirzasoleiman, Baharan and Abari, Omid}, journal = {International Conference on Machine Learning (ICML)}, year = {2024} }
UAI
Graph Contrastive Learning under Heterophily via Graph Filters

Wenhan Yang., and Baharan Mirzasoleiman

Conference on Uncertainty in Artificial Intelligence (UAI), 2024

Abs Bib PDF Code

Graph contrastive learning (CL) methods learn node representations in a self-supervised manner by maximizing the similarity between the augmented node representations obtained via a GNN-based encoder. However, CL methods perform poorly on graphs with heterophily, where connected nodes tend to belong to different classes. In this work, we address this problem by proposing an effective graph CL method, namely HLCL, for learning graph representations under heterophily. HLCL first identifies a homophilic and a heterophilic subgraph based on the cosine similarity of node features. It then uses a low-pass and a high-pass graph filter to aggregate representations of nodes connected in the homophilic subgraph and differentiate representations of nodes in the heterophilic subgraph. The final node representations are learned by contrasting both the augmented high-pass filtered views and the augmented low-pass filtered node views. Our extensive experiments show that HLCL outperforms state-ofthe-art graph CL methods on benchmark datasets with heterophily, as well as large-scale real-world graphs, by up to 7%, and outperforms graph supervised learning methods on datasets with heterophily by up to 10%.
@article{yang2023contrastive, title = {Graph Contrastive Learning under Heterophily via Graph Filters}, author = {Yang., Wenhan and Mirzasoleiman, Baharan}, journal = {Conference on Uncertainty in Artificial Intelligence (UAI)}, year = {2024} }
UAI
Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise

Yihao Xue, Kyle Whitecross, and Baharan Mirzasoleiman

Conference on Uncertainty in Artificial Intelligence (UAI), 2024

Spotlight presentation

Abs Bib PDF Supp

Increasing the size of overparameterized neural networks has been a key in achieving state-of-the-art performance. This is captured by the double descent phenomenon, where the test loss follows a decreasing-increasing-decreasing pattern (or sometimes monotonically decreasing) as model width increases. However, the effect of label noise on the test loss curve has not been fully explored. In this work, we uncover an intriguing phenomenon where label noise leads to a final ascent in the originally observed double descent curve. Specifically, under a sufficiently large noise-to-sample-size ratio, optimal generalization is achieved at intermediate widths. Through theoretical analysis, we attribute this phenomenon to the shape transition of test loss variance induced by label noise. Furthermore, we extend the final ascent phenomenon to model density and provide the first theoretical characterization showing that reducing density by randomly dropping trainable parameters improves generalization under label noise. We also thoroughly examine the roles of regularization and sample size. Surprisingly, we find that larger l2 regularization and robust learning methods against label noise exacerbate the final ascent. We confirm the validity of our findings through extensive experiments on ReLu networks trained on MNIST, ResNets/ViTs trained on CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world noisy labels.
@article{xue2024final, title = {Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise}, author = {Xue, Yihao and Whitecross, Kyle and Mirzasoleiman, Baharan}, journal = {Conference on Uncertainty in Artificial Intelligence (UAI)}, year = {2024}, award = {Spotlight presentation}, noise = {true} }
AISTATS
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Siddharth Joshi, Arnav Jain, Ali Payani, and Baharan Mirzasoleiman

International Conference on Artificial Intelligence and Statistics (AISTATS), 2024

Abs Bib PDF Supp

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP’s performance than increasing its volume. Nevertheless, finding a subset of image-caption pairs that provably generalizes on par with the full data when trained on, has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that best preserve the cross-covariance of the images and captions of the full data best preserve CLIP’s generalization performance. Our extensive experiments on ConceptualCaptions3M demonstrates that subsets of size 5%-10% found by ClipCov over 150% and 40% the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subset exhibits average relative performance improvement over the next best baseline of nearly 50% across 14 downstream datasets.
@article{joshi2024data, title = {Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity}, author = {Joshi, Siddharth and Jain, Arnav and Payani, Ali and Mirzasoleiman, Baharan}, journal = {International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2024}, efficient = {true} }
AISTATS
Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias

Yu Yang, Eric Gan, Gintare Karolina Dziugaite, and Baharan Mirzasoleiman

International Conference on Artificial Intelligence and Statistics (AISTATS), 2024

Abs Bib PDF Supp Code

Neural networks trained with (stochastic) gradient descent have an inductive bias towards learning simpler solutions. This makes them highly prone to learning spurious correlations in the training data, that may not hold at test time. In this work, we provide the first theo- retical analysis of the effect of simplicity bias on learning spurious correlations. Notably, we show that examples with spurious features are provably separable based on the model’s output early in training. We further illustrate that if spurious features have a small enough noise-to-signal ratio, the network’s output on majority of examples is almost exclusively determined by the spurious features, leading to poor worst-group test accuracy. Finally, we propose Spare, which identifies spurious correlations early in training, and utilizes importance sampling to alleviate their effect. Empirically, we demonstrate that Spare outperforms state-of-the-art methods by up to 21.1% in worst-group accuracy, while being up to 12x faster. We also show that Spare is a highly effective but lightweight method to discover spurious correlations. Code is available at https://github.com/BigML-CS-UCLA/SPARE.
@article{yang2024identifying, title = {Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias}, author = {Yang, Yu and Gan, Eric and Dziugaite, Gintare Karolina and Mirzasoleiman, Baharan}, journal = {International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2024}, spurious = {true} }
ICLR
Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift

Yihao Xue, Siddharth Joshi, Dang Nguyen, and Baharan Mirzasoleiman

International Conference on Learning Representations (ICLR), 2024

Abs Bib PDF Supp Website

Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP (Radford et al., 2021), have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL’s robustness: intra-class contrasting, which allows the model to learn features with a high variance, and inter-class feature sharing, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO (Lin et al., 2014)/Conceptual Captions (Sharma et al., 2018) and evaluating them on shifted ImageNets.
@article{xue2024robustness, title = {Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift}, author = {Xue, Yihao and Joshi, Siddharth and Nguyen, Dang and Mirzasoleiman, Baharan}, journal = {International Conference on Learning Representations (ICLR)}, year = {2024}, spurious = {true} }
ICLR
Investigating the Benefits of Projection Head for Representation Learning

Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi, and Baharan Mirzasoleiman

International Conference on Learning Representations (ICLR), 2024

Abs Bib PDF Supp Website

An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normal- ized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.
@article{xue2024projection, title = {Investigating the Benefits of Projection Head for Representation Learning}, author = {Xue, Yihao and Gan, Eric and Ni, Jiayi and Joshi, Siddharth and Mirzasoleiman, Baharan}, journal = {International Conference on Learning Representations (ICLR)}, year = {2024}, spurious = {true} }
ICLR
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

Xuxi Chen*, Yu Yang*, Zhangyang Wang, and Baharan Mirzasoleiman

International Conference on Learning Representations (ICLR), 2024

Abs Bib PDF Supp Code

Dataset distillation aims to minimize the time and memory needed for training deep networks on large datasets, by creating a small set of synthetic images that has a similar generalization performance to that of the full dataset. However, current dataset distillation techniques fall short, showing a notable performance gap compared to training on the original data. In this work, we are the first to argue that the use of only one synthetic subset for distillation may not yield optimal generalization performance. This is because the training dynamics of deep networks drastically changes during training. Therefore, multiple synthetic subsets are required to capture the dynamics of training in different stages. To address this issue, we propose Progressive Dataset Distillation (PDD). PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets without requiring additional training time. Our extensive experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%. In addition, our method for the first time enables generating considerably larger synthetic datasets. Our codes are available at https://github.com/VITA-Group/ProgressiveDD.
@article{chen2024distillation, title = {Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality}, author = {Chen*, Xuxi and Yang*, Yu and Wang, Zhangyang and Mirzasoleiman, Baharan}, journal = {International Conference on Learning Representations (ICLR)}, year = {2024}, efficient = {true} }

2023

NeurIPS
Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks

Wenhan Yang., Jingdong Gao, and Baharan Mirzasoleiman

Advances in Neural Information Processing Systems (NeurIPS), 2023

Abs Bib PDF Supp Code

Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image- caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of targeted data poisoning and backdoor attacks. Despite this vulnerability, robust contrastive vision-language pre-training against such attacks has remained unaddressed. In this work, we propose ROCLIP, the first effective method for robust pre-training multimodal vision-language models against targeted data poi- soning and backdoor attacks. ROCLIP effectively breaks the association between poisoned image-caption pairs by considering a relatively large and varying pool of random captions, and matching every image with the text that is most similar to it in the pool instead of its own caption, every few epochs.It also leverages image and text augmentations to further strengthen the defense and improve the performance of the model. Our extensive experiments show that ROCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models. In particular, ROCLIP decreases the success rate for targeted data poisoning attacks from 93.75% to 12.5% and that of backdoor attacks down to 0%, while improving the model’s linear probe performance by 10% and maintains a similar zero shot performance compared to CLIP. By increasing the frequency of matching, ROCLIP is able to defend strong attacks, which add up to 1% poisoned examples to the data, and successfully maintain a low attack success rate of 12.5%, while trading off the performance on some tasks.
@article{yang2023robust, title = {Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks}, author = {Yang., Wenhan and Gao, Jingdong and Mirzasoleiman, Baharan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2023}, poison = {true} }
NeurIPS
Robust Learning with Progressive Data Expansion Against Spurious Correlation

Yihe Deng*, Yu Yang*, Baharan Mirzasoleiman, and Quanquan Gu

Advances in Neural Information Processing Systems (NeurIPS), 2023

Abs Bib PDF Supp

While deep learning models have shown remarkable performance in various tasks, they are susceptible to learning non-generalizable spurious features rather than the core features that are genuinely correlated to the true label. In this paper, beyond existing analyses of linear models, we theoretically examine the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. In light of this, we propose a new training algorithm called PDE that efficiently enhances the model’s robustness for a better worst-group performance. PDE begins with a group-balanced subset of training data and progressively expands it to facilitate the learning of the core features. Experiments on synthetic and real-world benchmark datasets confirm the superior performance of our method on models such as ResNets and Transformers. On average, our method achieves a 2.8% improvement in worst-group accuracy compared with the state-of-the-art method, while enjoying up to 10⇥ faster training efficiency.
@article{deng2023robust, title = {Robust Learning with Progressive Data Expansion Against Spurious Correlation}, author = {Deng*, Yihe and Yang*, Yu and Mirzasoleiman, Baharan and Gu, Quanquan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2023}, spurious = {true} }
J. Affect. Disord.
Sleep, Brain Systems, and Persistent Stress in Early Adolescents During COVID-19: Insights from the ABCD Study

Orsolya Kiss, Zihan Qu, Eva M. Müller-Oehring, Fiona C. Baker, and Baharan Mirzasoleiman

Journal of Affective Disorders, 2023

Abs Bib PDF

Purpose: The first year of the COVID-19 pandemic constituted a major life stress event for many adolescents, associated with disrupted school, behaviors, social networks, and health concerns. However, pandemic-related stress was not equivalent for everyone and could have been influenced by pre-pandemic factors including brain structure and sleep, which both undergo substantial development during adolescence. Here, we analyzed clusters of perceived stress levels across the pandemic and determined developmentally relevant pre-pandemic risk factors in brain structure and sleep of persistently high stress during the first year of the COVID-19 pandemic. Methods: We investigated longitudinal changes in perceived stress at six timepoints across the first year of the pandemic (May 2020–March 2021) in 5559 adolescents (50% female; age range: 11–14 years) in the United States (U.S.) participating in the Adolescent Brain Cognitive Development (ABCD) study. In 3141 of these adolescents, we fitted machine learning models to identify the most important pre-pandemic predictors from structural MRI brain measures and self-reported sleep data that were associated with persistently high stress across the first year of the pandemic. Results: Patterns of perceived stress levels varied across the pandemic, with 5% reporting persistently high stress. Our classifiers accurately detected persistently high stress (AUC > 0.7). Pre-pandemic brain structure, specif- ically cortical volume in temporal regions, and cortical thickness in multiple parietal and occipital regions, predicted persistent stress. Pre-pandemic sleep difficulties and short sleep duration were also strong predictors of persistent stress, along with more advanced pubertal stage. Conclusions: Adolescents showed variable stress responses during the first year of the COVID-19 pandemic, and some reported persistently high stress across the whole first year. Vulnerability to persistent stress was evident in several brain structural and self-reported sleep measures, collected before the pandemic, suggesting the relevance of other pre-existing individual factors beyond pandemic-related factors, for persistently high stress responses.
@article{kiss2023sleep, title = {Sleep, Brain Systems, and Persistent Stress in Early Adolescents During COVID-19: Insights from the ABCD Study}, author = {Kiss, Orsolya and Qu, Zihan and Müller-Oehring, Eva M. and Baker, Fiona C. and Mirzasoleiman, Baharan}, journal = {Journal of Affective Disorders}, year = {2023} }
ICML
Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning

Yu Yang, Besmira Nushi, Hamid Palangi, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2023

Abs Bib PDF Supp

Spurious correlations that degrade model generalization or lead the model to be right for the wrong reasons are one of the main robustness concerns for real-world deployments. However, mitigating these correlations during pre-training for large-scale models can be costly and impractical, particularly for those without access to high-performance computing resources. This paper proposes a novel approach to address spurious correlations during fine-tuning for a given domain of interest. With a focus on multi-modal models (e.g., CLIP), the proposed method leverages different modalities in these models to detect and explicitly set apart spurious attributes from the affected class, achieved through a multi-modal contrastive loss function that expresses spurious relationships through language. Our experimental results and in-depth visualizations on CLIP show that such an intervention can effectively i) improve the model’s accuracy when spurious attributes are not present, and ii) directs the model’s activation maps towards the actual class rather than the spurious attribute when present. In particular, on the Waterbirds dataset, our algorithm achieved a worst-group accuracy 23% higher than ERM on CLIP with a ResNet-50 backbone, and 32% higher on CLIP with a ViT backbone, while maintaining the same average accuracy as ERM.
@article{yang2023mitigating, title = {Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning}, author = {Yang, Yu and Nushi, Besmira and Palangi, Hamid and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2023}, spurious = {true} }
ICML
Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least

Siddharth Joshi, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2023

Abs Bib PDF Supp Code Website

Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required for learning high-quality representations. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of SSL on such subsets. Empirically, we discover, perhaps surprisingly, the subsets that contribute the most to SSL are those that contribute the least to supervised learning. Through extensive experiments, we show that our subsets outperform random subsets by more than 3% on CIFAR100, CIFAR10, and STL10. Interestingly, we also find that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10, without affecting downstream task performance.
@article{joshi2023data, title = {Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least}, author = {Joshi, Siddharth and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2023}, efficient = {true} }
ICML
Which Features are Learned by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression

Yihao Xue, Siddharth Joshi, Eric Gan, Pin-Yu Chen, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2023

Oral presentation (top 2%)

Abs Bib PDF Supp Website

Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder class-relevant features by focusing on learning easy class-irrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of class collapse or feature suppression at test time. We provide the first unified theoretically rigorous framework to determine which features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder classrelevant features. Moreover, we present increasing embedding dimensionality and improving the quality of data augmentations as two theoretically motivated solutions to feature suppression. We also provide the first theoretical explanation for why employing supervised and unsupervised CL together yields higher-quality representations, even when using commonly-used stochastic gradient methods.
@article{xue2023which, title = {Which Features are Learned by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression}, author = {Xue, Yihao and Joshi, Siddharth and Gan, Eric and Chen, Pin-Yu and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2023}, award = {Oral presentation (top 2%)} }
ICML
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning

Yu Yang, Hao Kang, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2023

Abs Bib PDF Supp Website

To improve the efficiency and sustainability of learning deep models, we propose CREST, the first scalable framework with rigorous theoretical guarantees to identify the most valuable examples for training non-convex models, particularly deep networks. To guarantee convergence to a stationary point of a non-convex function, CREST models the non-convex loss as a series of quadratic functions and extracts a coreset for each quadratic sub-region. In addition, to ensure faster convergence of stochastic gradient methods such as (mini-batch) SGD, CREST iteratively extracts multiple mini-batch coresets from larger random subsets of training data, to ensure nearly-unbiased gradients with small variances. Finally, to further improve scalability and efficiency, CREST identifies and excludes the examples that are learned from the coreset selection pipeline. Our extensive experiments on several deep networks trained on vision and NLP datasets, including CIFAR-10, CIFAR-100, TinyImageNet, and SNLI, confirm that CREST speeds up training deep networks on very large datasets, by 1.7x to 2.5x with minimum loss in the performance. By analyzing the learning difficulty of the subsets selected by CREST, we show that deep models benefit the most by learning from subsets of increasing difficulty levels.
@article{yang2023towards, title = {Towards Sustainable Learning: Coresets for Data-efficient Deep Learning}, author = {Yang, Yu and Kang, Hao and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, year = {2023}, efficient = {true} }
HotStorage
NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training

Neha Prakriya, Yu Yang, Baharan Mirzasoleiman, Cho-Jui Hsieh, and Jason Cong

ACM Workshop on Hot Topics in Storage and File Systems (HotStorage), 2023

Abs Bib PDF

Large-scale machine learning (ML) models rely on extremely large datasets to learn their exponentially growing number of parameters. While these models achieve unprecedented success, the increase in training time and hardware resources required is unsustainable. Further, we find that as dataset sizes increase, data movement becomes a significant com- ponent of overall training time. We propose NeSSA, a novel SmartSSD+GPU training architecture to intelligently select important subsets of large datasets near-storage, such that training on the subset mimics training on the full dataset with a very small loss in accuracy. To the best of our knowl- edge, this is the first work to propose such a near-storage data selection model for efficient ML training. We have evalu- ated our method for the CIFAR-10, SVHN, CINIC-10, CIFAR- 100, TinyImageNet, and ImageNet-100 datasets. We also test across ResNet-20, ResNet-18, and ResNet-50 models.
@article{prakriya23nessa, title = {NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training}, author = {Prakriya, Neha and Yang, Yu and Mirzasoleiman, Baharan and Hsieh, Cho-Jui and Cong, Jason}, journal = {ACM Workshop on Hot Topics in Storage and File Systems (HotStorage)}, year = {2023}, efficient = {true} }
AISTATS
High Probability Bounds for Stochastic Continuous Submodular Maximization

Evan Becker, Jingdong Gao, Ted Zadouri, and Baharan Mirzasoleiman

International Conference on Artificial Intelligence and Statistics (AISTATS), 2023

Abs Bib PDF Supp

We consider maximization of stochastic monotone continuous submodular functions (CSF) with a diminishing return property. Existing algorithms only guarantee the performance in expectation, and do not bound the probability of getting a bad solution. This implies that for a particular run of the algorithms, the solution may be much worse than the provided guarantee in expectation. In this paper, we first empirically verify that this is indeed the case. Then, we provide the first high-probability analysis of the existing methods for stochastic CSF maximization, namely PGA, boosted PGA, SCG, and SCG++. Finally, we provide an improved high-probability bound for SCG, under slightly stronger assumptions, with a better convergence rate than that of the expected solution. Through extensive experiments on non-concave quadratic programming (NQP) and optimal budget allocation, we confirm the validity of our bounds and show that even in the worst-case, PGA converges to OPT/2, and boosted PGA, SCG, SCG++ converge to (1-1/e)OPT, but at a slower rate than that of the expected solution.
@article{becker2023high, title = {High Probability Bounds for Stochastic Continuous Submodular Maximization}, author = {Becker, Evan and Gao, Jingdong and Zadouri, Ted and Mirzasoleiman, Baharan}, journal = {International Conference on Artificial Intelligence and Statistics (AISTATS)}, pages = {5958--5979}, year = {2023}, organization = {PMLR}, }
ICDH
A Self-supervised Framework for Improved Data-Driven Monitoring of Stress via Multi-modal Passive Sensing

Shayan Fazeli, Lionel Levine, Mehrab Beikzadeh, Baharan Mirzasoleiman, Bita Zadeh, Tara Peris, and Majid Sarrafzadeh

IEEE Conference on Digital Health (ICDH), 2023

Abs Bib PDF Code

Recent advances in remote health monitoring systems have significantly benefited patients and played a crucial role in improving their quality of life. However, while physiological health-focused solutions have demonstrated increasing success and maturity, mental health-focused applications have seen comparatively limited success in spite of the fact that stress and anxiety disorders are among the most common issues people deal with in their daily lives. In the hopes of furthering progress in this domain through the development of a more robust analytic framework for the measurement of indicators of mental health, we propose a multi-modal semi-supervised framework for tracking physiological precursors of the stress response. Our methodology enables utilizing multi-modal data of differing domains and resolutions from wearable devices and leveraging them to map short-term episodes to semantically efficient embeddings for a given task. Additionally, we leverage an inter-modality contrastive objective, with the advantages of rendering our framework both modular and scalable. The focus on optimizing both local and global aspects of our embeddings via a hierarchical structure renders transferring knowledge and compatibility with other devices easier to achieve. In our pipeline, a task-specific pooling based on an attention mechanism, which estimates the contribution of each modality on an instance level, computes the final embeddings for observations. This additionally provides a thorough diagnostic insight into the data characteristics and highlights the importance of signals in the broader view of predicting episodes annotated per mental health status. We perform training experiments using a corpus of realworld data on perceived stress, and our results demonstrate the efficacy of the proposed approach in performance improvements
@article{fazeli2023self, title = {A Self-supervised Framework for Improved Data-Driven Monitoring of Stress via Multi-modal Passive Sensing}, author = {Fazeli, Shayan and Levine, Lionel and Beikzadeh, Mehrab and Mirzasoleiman, Baharan and Zadeh, Bita and Peris, Tara and Sarrafzadeh, Majid}, journal = {IEEE Conference on Digital Health (ICDH)}, year = {2023}, }
TKDE
On the fairness of time-critical influence maximization in social networks

Junaid Ali, Mahmoudreza Babaei, Abhijnan Chakraborty, Baharan Mirzasoleiman, Krishna Gummadi, and Adish Singla

IEEE Transactions on Knowledge and Data Engineering (TKDE), 2023

Abs Bib PDF

Influence maximization has found applications in a wide range of real-world problems, for instance, viral marketing of products in an online social network, and propagation of valuable information such as job vacancy advertisements. While existing algorithmic techniques usually aim at maximizing the total number of people influenced, the population often comprises several socially salient groups, e.g., based on gender or race. As a result, these techniques could lead to disparity across different groups in receiving important information. Furthermore, in many applications, the spread of influence is time-critical, i.e., it is only beneficial to be influenced before a deadline. As we show in this paper, such time-criticality of information could further exacerbate the disparity of influence across groups. This dis- parity could have far-reaching consequences, impacting people’s prosperity and putting minority groups at a big disadvantage. In this work, we propose a notion of group fairness in time- critical influence maximization. We introduce surrogate objective functions to solve the influence maximization problem under fair- ness considerations. By exploiting the submodularity structure of our objectives, we provide computationally efficient algorithms with guarantees that are effective in enforcing fairness during the propagation process. Extensive experiments on synthetic and real-world datasets demonstrate the efficacy of our proposal.
@article{ali2023fairness, title = {On the fairness of time-critical influence maximization in social networks}, author = {Ali, Junaid and Babaei, Mahmoudreza and Chakraborty, Abhijnan and Mirzasoleiman, Baharan and Gummadi, Krishna and Singla, Adish}, journal = {IEEE Transactions on Knowledge and Data Engineering (TKDE)}, year = {2023}, publisher = {IEEE}, }

2022

NeurIPS
Friendly Noise against Adversarial Noise: A Powerful Defense against Data Poisoning Attack

Tian Yu Liu, Yu Yang, and Baharan Mirzasoleiman

Advances in Neural Information Processing Systems (NeurIPS), 2022

Abs Bib PDF Supp Code

A powerful category of (invisible) data poisoning attacks modify a subset of training examples by small adversarial perturbations to change the prediction of certain test-time data. Existing defense mechanisms are not desirable to deploy in practice, as they often either drastically harm the generalization performance, or are attack-specific, and prohibitively slow to apply. Here, we propose a simple but highly effective approach that unlike existing methods breaks various types of invisible poisoning attacks with the slightest drop in the generalization performance. We make the key observation that attacks introduce local sharp regions of high training loss, which when minimized, results in learning the adversarial perturbations and makes the attack successful. To break poisoning attacks, our key idea is to alleviate the sharp loss regions introduced by poisons. To do so, our approach comprises two components: an optimized friendly noise that is generated to maximally perturb examples without degrading the performance, and a randomly varying noise component. The combination of both components builds a very light-weight but extremely effective defense against the most powerful triggerless targeted and hidden-trigger backdoor poisoning attacks, including Gradient Matching, Bulls-eye Polytope, and Sleeper Agent. We show that our friendly noise is transferable to other architectures, and adaptive attacks cannot break our defense due to its random noise component.
@article{liufriendly, title = {Friendly Noise against Adversarial Noise: A Powerful Defense against Data Poisoning Attack}, author = {Liu, Tian Yu and Yang, Yu and Mirzasoleiman, Baharan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2022}, poison = {true} }
NeurIPS
Data-Efficient Augmentation for Training Neural Networks

Tian Yu Liu, and Baharan Mirzasoleiman

Advances in Neural Information Processing Systems (NeurIPS), 2022

Abs Bib PDF Supp Code

Data augmentation is essential to achieve state-of-the-art performance in many deep learning applications. However, the most effective augmentation techniques become computationally prohibitive for even medium-sized datasets. To address this, we propose a rigorous technique to select subsets of data points that when augmented, closely capture the training dynamics of full data augmentation. We first show that data augmentation, modeled as additive perturbations, improves learning and generalization by relatively enlarging and perturbing the smaller singular values of the network Jacobian, while preserving its prominent directions. This prevents overfitting and enhances learning the harder to learn information. Then, we propose a framework to iteratively extract small subsets of training data that when augmented, closely capture the alignment of the fully augmented Jacobian with labels/residuals. We prove that stochastic gradient descent applied to the augmented subsets found by our approach has similar training dynamics to that of fully augmented data. Our experiments demonstrate that our method achieves 6.3x speedup on CIFAR10 and 2.2x speedup on SVHN, and outperforms the baselines by up to 10% across various subset sizes. Similarly, on TinyImageNet and ImageNet, our method beats the baselines by up to 8%, while achieving up to 3.3x speedup across various subset sizes. Finally, training on and augmenting 50% subsets using our method on a version of CIFAR10 corrupted with label noise even outperforms using the full dataset.
@article{liudata, title = {Data-Efficient Augmentation for Training Neural Networks}, author = {Liu, Tian Yu and Mirzasoleiman, Baharan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2022}, efficient = {true} }
ICML
Not all poisons are created equal: Robust training against data poisoning

Yu Yang, Tian Yu Liu, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2022

Oral presentation (top 2%)

Abs Bib PDF Supp Code

Data poisoning causes misclassification of test time target examples, by injecting maliciously crafted samples in the training data. Existing defenses are often effective only against a specific type of targeted attack, significantly degrade the generalization performance, or are prohibitive for standard deep learning pipelines. In this work, we propose an efficient defense mechanism that significantly reduces the success rate of various data poisoning attacks, and provides theoretical guarantees for the performance of the model. Targeted attacks work by adding bounded perturbations to a randomly selected subset of training data to match the targets’ gradient or representation. We show that: (i) under bounded perturbations, only a number of poisons can be optimized to have a gradient that is close enough to that of the target and make the attack successful; (ii) such effective poisons move away from their original class and get isolated in the gradient space; (iii) dropping examples in low-density gradient regions during training can successfully eliminate the effective poisons, and guarantees similar training dynamics to that of training on full data. Our extensive experiments show that our method significantly decreases the success rate of state-of-the-art targeted attacks, including Gradient Matching and Bullseye Polytope, and easily scales to large datasets.
@article{yang2022not, title = {Not all poisons are created equal: Robust training against data poisoning}, author = {Yang, Yu and Liu, Tian Yu and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, pages = {25154--25165}, year = {2022}, organization = {PMLR}, poison = {true}, award = {Oral presentation (top 2%)} }
ICML
Adaptive second order coresets for data-efficient machine learning

Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2022

Abs Bib PDF Supp

Training machine learning models on massive datasets incurs substantial computational costs. To alleviate such costs, there has been a sustained effort to develop data-efficient training methods that can carefully select subsets of the training examples that generalize on par with the full training data. However, existing methods are limited in providing theoretical guarantees for the quality of the models trained on the extracted subsets, and may perform poorly in practice. We propose AdaCore, a method that leverages the geometry of the data to extract subsets of the training examples for efficient machine learning. The key idea behind our method is to dynamically approximate the curvature of the loss function via an exponentially-averaged estimate of the Hessian to select weighted subsets (coresets) that provide a close approximation of the full gradient preconditioned with the Hessian. We prove rigorous guarantees for the convergence of various first and second-order methods applied to the subsets chosen by AdaCore. Our extensive experiments show that AdaCore extracts coresets with higher quality compared to baselines and speeds up training of convex and non-convex machine learning models, such as logistic regression and neural networks, by over 2.9 x over the full data and 4.5 x over random subsets.
@article{pooladzandi2022adaptive, title = {Adaptive second order coresets for data-efficient machine learning}, author = {Pooladzandi, Omead and Davini, David and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, pages = {17848--17869}, year = {2022}, organization = {PMLR}, efficient = {true} }
ICML
Investigating why contrastive learning benefits robustness against label noise

Yihao Xue, Kyle Whitecross, and Baharan Mirzasoleiman

International Conference on Machine Learning (ICML), 2022

Abs Bib PDF Supp

Self-supervised Contrastive Learning (CL) has been recently shown to be very effective in preventing deep networks from overfitting noisy labels. Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that the representation matrix learned by contrastive learning boosts robustness, by having:(i) one prominent singular value corresponding to each sub-class in the data, and significantly smaller remaining singular values; and (ii) a large alignment between the prominent singular vectors and the clean labels of each sub-class. The above properties enable a linear layer trained on such representations to effectively learn the clean labels without overfitting the noise. We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve state-of-the-art performance under extreme noise levels, eg, an average of 27.18% and 15.58% increase in accuracy on CIFAR-10 and CIFAR-100 with 80% symmetric noisy labels, and 4.11% increase in accuracy on WebVision.
@article{xue2022investigating, title = {Investigating why contrastive learning benefits robustness against label noise}, author = {Xue, Yihao and Whitecross, Kyle and Mirzasoleiman, Baharan}, journal = {International Conference on Machine Learning (ICML)}, pages = {24851--24871}, year = {2022}, organization = {PMLR}, noise = {true} }
Syn.Data4ML
Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization

Omead Pooladzandi, Pasha Khosravi, Erik Nijkamp, and Baharan Mirzasoleiman

Neurips SyntheticData4ML Workshop, 2022

Abs Bib PDF

Generative models have the ability to synthesize data points drawn from the data distribution, however, not all generated samples are high quality. In this paper, we propose using a combination of coresets selection methods and “entropic regularization” to select the highest fidelity samples. We leverage an Energy-Based Model which resembles a variational auto-encoder with an inference and generator model for which the latent prior is complexified by an energy-based model. In a semi-supervised learning scenario, we show that augmenting the labeled data-set, by adding our selected subset of samples, leads to better accuracy improvement rather than using all the synthetic samples.
@article{pooladzandi2023generating, title = {Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization}, author = {Pooladzandi, Omead and Khosravi, Pasha and Nijkamp, Erik and Mirzasoleiman, Baharan}, journal = {Neurips SyntheticData4ML Workshop}, year = {2022}, }
BIBM
Passive Monitoring of Physiological Precursors of Stress Leveraging Smartwatch Data

Shayan Fazeli, Lionel Levine, Mehrab Beikzadeh, Baharan Mirzasoleiman, Bita Zadeh, Tara Peris, and Majid Sarrafzadeh

IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2022

Abs Bib PDF Code

Developing the capability to continuously and noninvasively monitor the mental health status of individuals is a critical focus in the mHealth domain. The use of passivelygenerated data gathered via smart and portable electronic devices to monitor specific indicators of mental health has shown potential to serve as an effective alternative to traditional intrusive survey-based approaches to monitoring mental health remotely. In this study, we propose a remote health monitoring framework for dynamic, flexible, and scalable assessment and detection of physiological precursors of a stress response. Our method comprises a smartwatch-based system for continuous monitoring of primary physiological signals, followed by a deep neural network architecture that performs the fusion and processing of the multi-modal sensor readings. We empirically validate our system on a cohort of university-affiliated members of the military. Our findings demonstrate the effectiveness of our passive-sensing system for tracking perceived stress, the results of which can be used to obtain a better understanding of patient behavior and improve personalized treatments.
@article{fazeli2022passive, title = {Passive Monitoring of Physiological Precursors of Stress Leveraging Smartwatch Data}, author = {Fazeli, Shayan and Levine, Lionel and Beikzadeh, Mehrab and Mirzasoleiman, Baharan and Zadeh, Bita and Peris, Tara and Sarrafzadeh, Majid}, journal = {IEEE International Conference on Bioinformatics and Biomedicine (BIBM)}, pages = {2893--2899}, year = {2022}, organization = {IEEE}, }
EAAMO
Towards Balanced Information Propagation in Social Media

Mahmoudreza Babaei, Baharan Mirzasoleiman, Jungseock Joo, and Adrian Weller

ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), 2022

Abs Bib PDF

As people increasingly rely on social media platforms such as Twitter to consume information, there are significant concerns about the diversity of news consumption. Users may narrow their attention to posts which reinforce their pre-existing views, which could lead to a more fragmented society. Aiming to combat this, earlier work divided news on a given story into high consensus and low consensus posts, based on how similar reactions can be expected from users with different political views: high consensus news elicits similar reactions, whereas low consensus news elicits different reactions from readers depending on their political leanings. In this work, we propose and quantify the benefits of a strategy to spread high consensus news across readers with diverse political leanings. We first compile a dataset and make the following three key observations: (1) low consensus news is more likely to remain within subgroups of users with similar political leanings, whereas high consensus news spreads more across subgroups; (2) high consensus news posted by neutral publishers spreads more equally across subgroups; and (3) users that get the information from other users instead of the publishers, get an even more biased exposure to news. Then, we propose a strategy that spreads high consensus news through neutral publishers, and quantify the significant decrease in the disparity of users’ news exposure. Our extensive experiments on Twitter shows that seeding high consensus information with neutral publishers is an effective way to achieve high spread with little disparity regarding political leaning.
@article{babaeitowards, title = {Towards Balanced Information Propagation in Social Media}, author = {Babaei, Mahmoudreza and Mirzasoleiman, Baharan and Joo, Jungseock and Weller, Adrian}, journal = {ACM conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO)}, year = {2022}, }
CompBio
Purification of single-cell transcriptomics data with coreset selection

Róbert Pálovics, Tony Wyss-Coray, and Baharan Mirzasoleiman

ICML Workshop on Computational Biology (CompBio), 2022

Abs Bib PDF

Despite the overall success of single-cell transcriptomics, variations in the number of cells captured from biological replicates in different regions of the embedding space of cells limit the interpretation of downstream computational analyses. Here we introduce a coreset selection based purification method to alleviate potential replicate specific biases within single-cell datasets. We first identify regions of the embedding space of cells that are not biased towards single biological replicates, and then extract a representative cell subset (coreset) covering them. We demonstrate that the extracted coresets provide a solid ground for downstream analyses. Specifically, we show that differential gene expression signatures based on purified datasets are robust against replicate specific biases across 24 different cell-type specific single-cell datasets. Furthermore, we highlight that purification can enhance supervised learning from single-cell transcriptomics data. Our results indicate substantial improvement in predictive performance (up to 0.16 gain in AUC) when testing logistic regression models on 8 cell type specific datasets across two independent cohorts.
@article{palovicspurification, title = {Purification of single-cell transcriptomics data with coreset selection}, author = {P{\'a}lovics, R{\'o}bert and Wyss-Coray, Tony and Mirzasoleiman, Baharan}, journal = {ICML Workshop on Computational Biology (CompBio)}, year = {2022}, }
TempWeb
Analytical Models for Motifs in Temporal Networks

Alexandra Porter, Baharan Mirzasoleiman, and Jure Leskovec

Temporal Web Analytics Workshop (TempWeb), 2022

Abs Bib PDF

Dynamic evolving networks capture temporal relations in domains such as social networks, communication networks, and financial transaction networks. In such networks, temporal motifs, which are repeated sequences of time-stamped edges/transactions, offer valuable information about the networks’ evolution and function. However, calculating temporal motif frequencies is computationally expensive as it requires: First, identifying all instances of the static motifs in the static graph induced by the temporal graph. And second, counting the number of subsequences of temporal edges that correspond to a temporal motif and occur within a time window. Since the number of temporal motifs changes over time, finding interesting temporal patterns involves iterative application of the above process over many consecutive time windows. This makes it impractical to scale to large real temporal networks. Here, we develop a fast and accurate model-based method for counting motifs in temporal networks. We first develop the Temporal Activity State Block Model (TASBM), to model temporal motifs in temporal graphs. Then we derive closed-form analytical expressions that allow us to quickly calculate expected motif frequencies and their variances in a given temporal network. Finally, we develop an efficient model fitting method, so that for a given network, we quickly fit the TASMB model and compute motif frequencies. We apply our approach to two real-world networks: a network of financial transactions and an email network. Experiments show that our TASMB framework (1) accurately counts temporal motifs in temporal networks; (2) easily scales to networks with tens of millions of edges/transactions; (3) is about 50x faster than explicit motif counting methods on networks of about 5 million temporal edges, a factor which increases with network size.
@article{porter2022analytical, title = {Analytical Models for Motifs in Temporal Networks}, author = {Porter, Alexandra and Mirzasoleiman, Baharan and Leskovec, Jure}, journal = {Temporal Web Analytics Workshop (TempWeb)}, pages = {903--909}, year = {2022}, }
SNN
Low Rank Pruning via Output Perturbation

Yuhan Liu, Siddharth Joshi, and Baharan Mirzasoleiman

Sparsity in Neural Networks Workshop (SNN), 2022

Abs Bib PDF

Neural networks have become very widespread due to the mainstream availability of computational devices such as GPUs, and as these devices become more powerful, these networks have become much larger. With the growing demand for fast, efficient networks, weight pruning has become a popular technique for reducing both the speed and computational time of these networks, but they introduce sparse matrices, which can be tedious to implement properly. In this paper, we investigate a different approach to model pruning involving low rank decompositions and output perturbation.
@article{liu2022low, title = {Low Rank Pruning via Output Perturbation}, author = {Liu, Yuhan and Joshi, Siddharth and Mirzasoleiman, Baharan}, journal = {Sparsity in Neural Networks Workshop (SNN)}, year = {2022}, }
AAAI
Crosswalk: Fairness-enhanced node representation learning

Ahmad Khajehnejad, Moein Khajehnejad, Mahmoudreza Babaei, Krishna P Gummadi, Adrian Weller, and Baharan Mirzasoleiman

AAAI Conference on Artificial Intelligence (AAAI), 2022

Abs Bib PDF Supp

The potential for machine learning systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. Much recent work has focused on developing algorithmic tools to assess and mitigate such unfairness. However, there is little work on enhancing fairness in graph algorithms. Here, we develop a simple, effective and general method, CrossWalk, that enhances fairness of various graph algorithms, including influence maximization, link prediction and node classification, applied to node embeddings. CrossWalk is applicable to any random walk based node representation learning algorithm, such as DeepWalk and Node2Vec. The key idea is to bias random walks to cross group boundaries, by upweighting edges which (1) are closer to the groups’ peripheries or (2) connect different groups in the network. CrossWalk pulls nodes that are near groups’ peripheries towards their neighbors from other groups in the embedding space, while preserving the necessary structural properties of the graph. Extensive experiments show the effectiveness of our algorithm to enhance fairness in various graph algorithms, including influence maximization, link prediction and node classification in synthetic and real networks, with only a very small decrease in performance.
@article{khajehnejad2022crosswalk, title = {Crosswalk: Fairness-enhanced node representation learning}, author = {Khajehnejad, Ahmad and Khajehnejad, Moein and Babaei, Mahmoudreza and Gummadi, Krishna P and Weller, Adrian and Mirzasoleiman, Baharan}, journal = {AAAI Conference on Artificial Intelligence (AAAI)}, volume = {36}, number = {11}, pages = {11963--11970}, year = {2022}, }
ICDE
On the fairness of time-critical influence maximization in social networks

Junaid Ali, Mahmoudreza Christiansen Babaei, Abhijnan Chakraborty, Baharan Mirzasoleiman, Krishna Gummadi, and Adish Singla

IEEE International Conference on Data Engineering (ICDE), 2022

Abs Bib PDF

Influence maximization has found applications in a wide range of real-world problems, for instance, viral marketing of products in an online social network, and propagation of valuable information such as job vacancy advertisements. While existing algorithmic techniques usually aim at maximizing the total number of people influenced, the population often comprises several socially salient groups, e.g., based on gender or race. As a result, these techniques could lead to disparity across different groups in receiving important information. Furthermore, in many applications, the spread of influence is time-critical, i.e., it is only beneficial to be influenced before a deadline. As we show in this paper, such time-criticality of information could further exacerbate the disparity of influence across groups. This dis- parity could have far-reaching consequences, impacting people’s prosperity and putting minority groups at a big disadvantage. In this work, we propose a notion of group fairness in time- critical influence maximization. We introduce surrogate objective functions to solve the influence maximization problem under fair- ness considerations. By exploiting the submodularity structure of our objectives, we provide computationally efficient algorithms with guarantees that are effective in enforcing fairness during the propagation process. Extensive experiments on synthetic and real-world datasets demonstrate the efficacy of our proposal.
@article{ali2022fairness, title = {On the fairness of time-critical influence maximization in social networks}, author = {Ali, Junaid and Babaei, Mahmoudreza Christiansen and Chakraborty, Abhijnan and Mirzasoleiman, Baharan and Gummadi, Krishna and Singla, Adish}, journal = {IEEE International Conference on Data Engineering (ICDE)}, year = {2022}, publisher = {IEEE}, }

2020

UAI
Coresets for estimating means and mean square error with limited greedy samples

Saeed Vahidian, Baharan Mirzasoleiman, and Alexander Cloninger

Conference on Uncertainty in Artificial Intelligence (UAI), 2020

Abs Bib PDF Supp

In a number of situations, collecting a function value for every data point may be prohibitively expensive, and random sampling ignores any structure in the underlying data. We introduce a scalable optimization algorithm with no correction steps (in contrast to Frank–Wolfe and its variants), a variant of gradient ascent for coreset selection in graphs, that greedily selects a weighted subset of vertices that are deemed most important to sample. Our algorithm estimates the mean of the function by taking a weighted sum only at these vertices, and we provably bound the estimation error in terms of the location and weights of the selected vertices in the graph. In addition, we consider the case where nodes have different selection costs and provide bounds on the quality of the low-cost selected coresets. We demonstrate the benefits of our algorithm on the semi-supervised node classification of graph convolutional neural network, point clouds and structured graphs, as well as sensor placement where the cost of placing sensors depends on the location of the placement. We also elucidate that the empirical convergence of our proposed method is faster than random selection and various clustering methods while still respecting sensor placement cost. The paper concludes with validation of the developed algorithm on both synthetic and real datasets, demonstrating that it outperforms the current state of the art.
@article{vahidian2020coresets, title = {Coresets for estimating means and mean square error with limited greedy samples}, author = {Vahidian, Saeed and Mirzasoleiman, Baharan and Cloninger, Alexander}, journal = {Conference on Uncertainty in Artificial Intelligence (UAI)}, pages = {350--359}, year = {2020}, organization = {PMLR}, }
ICML
Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec

International Conference on Machine Learning (ICML), 2020

Abs Bib PDF Supp Code

Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near) optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.
@article{mirzasoleiman2020coresets, title = {Coresets for data-efficient training of machine learning models}, author = {Mirzasoleiman, Baharan and Bilmes, Jeff and Leskovec, Jure}, journal = {International Conference on Machine Learning (ICML)}, pages = {6950--6960}, year = {2020}, organization = {PMLR}, efficient = {true} }
NeurIPS
Coresets for robust training of deep neural networks against noisy labels

Baharan Mirzasoleiman, Kaidi Cao, and Jure Leskovec

Advances in Neural Information Processing Systems (NeurIPS), 2020

Abs Bib PDF Supp Code

Modern neural networks have the capacity to overfit noisy labels frequently found in real-world datasets. Although great progress has been made, existing techniques are very limited in providing theoretical guarantees for the performance of the neural networks trained with noisy labels. To tackle this challenge, we propose a novel approach with strong theoretical guarantees for robust training of neural networks trained with noisy labels. The key idea behind our method is to select subsets of clean data points that provide an approximately low-rank Jacobian matrix. We then prove that gradient descent applied to the subsets cannot overfit the noisy labels, without regularization or early stopping. Our extensive experiments corroborate our theory and demonstrate that deep networks trained on our subsets achieve a significantly superior performance, e.g., 7% increase in accuracy on mini Webvision with 50% noisy labels, compared to state-of-the art.
@article{mirzasoleiman2020coresett, title = {Coresets for robust training of deep neural networks against noisy labels}, author = {Mirzasoleiman, Baharan and Cao, Kaidi and Leskovec, Jure}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {33}, pages = {11465--11477}, year = {2020}, noise = {true} }
ICLR
Selection via Proxy: Efficient Data Selection for Deep Learning

Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia

International Conference on Learning Representations (ICLR), 2020

Abs Bib PDF Supp

Data selection methods, such as active learning and core-set selection, are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because they depend on feature representations that need to be learned. In this work, we show that we can greatly improve the computational efficiency by using a small proxy model to perform data selection (e.g., selecting data points to label for active learning). By removing hidden layers from the target model, using smaller architectures, and training for fewer epochs, we create proxies that are an order of magnitude faster to train. Although these small proxy models have higher error rates, we find that they empirically provide useful signals for data selection. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks across five datasets: CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity, and Amazon Review Full. For active learning, applying SVP can give an order of magnitude improvement in data selection runtime (i.e., the time it takes to repeatedly train and select points) without significantly increasing the final error (often within 0.1%). For core-set selection on CIFAR10, proxies that are over 10x faster to train than their larger, more accurate targets can remove up to 50% of the data without harming the final accuracy of the target, leading to a 1.6x end-to-end training time improvement.
@article{colemanselection, title = {Selection via Proxy: Efficient Data Selection for Deep Learning}, author = {Coleman, Cody and Yeh, Christopher and Mussmann, Stephen and Mirzasoleiman, Baharan and Bailis, Peter and Liang, Percy and Leskovec, Jure and Zaharia, Matei}, journal = {International Conference on Learning Representations (ICLR)}, year = {2020}, }

2018

AAAI
Streaming non-monotone submodular maximization: Personalized video summarization on the fly

Baharan Mirzasoleiman, Stefanie Jegelka, and Andreas Krause

AAAI Conference on Artificial Intelligence (AAAI), 2018

Abs Bib PDF Supp

The need for real time analysis of rapidly producing data streams (eg, video and image streams) motivated the design of streaming algorithms that can efficiently extract and summarize useful information from massive data" on the fly." Such problems can often be reduced to maximizing a submodular set function subject to various constraints. While efficient streaming methods have been recently developed for monotone submodular maximization, in a wide range of applications, such as video summarization, the underlying utility function is non-monotone, and there are often various constraints imposed on the optimization problem to consider privacy or personalization. We develop the first efficient single pass streaming algorithm, Streaming Local Search, that for any streaming monotone submodular maximization algorithm with approximation guarantee α under a collection of independence systems I, provides a constant 1/(1+ 2/√ α+ 1/α+ 2d (1+√ α)) approximation guarantee for maximizing a non-monotone submodular function under the intersection of I and d knapsack constraints. Our experiments show that for video summarization, our method runs more than 1700 times faster than previous work, while maintaining practically the same performance.
@article{mirzasoleiman2018streaming, title = {Streaming non-monotone submodular maximization: Personalized video summarization on the fly}, author = {Mirzasoleiman, Baharan and Jegelka, Stefanie and Krause, Andreas}, journal = {AAAI Conference on Artificial Intelligence (AAAI)}, volume = {32}, number = {1}, year = {2018}, }
NeurIPS
Dynamic network model from partial observations

Elahe Ghalebi, Baharan Mirzasoleiman, Radu Grosu, and Jure Leskovec

Advances in Neural Information Processing Systems (NeurIPS), 2018

Spotlight presentation (top 3%)

Abs Bib PDF

Can evolving networks be inferred and modeled without directly observing their nodes and edges? In many applications, the edges of a dynamic network might not be observed, but one can observe the dynamics of stochastic cascading processes (eg, information diffusion, virus propagation) occurring over the unobserved network. While there have been efforts to infer networks based on such data, providing a generative probabilistic model that is able to identify the underlying time-varying network remains an open question. Here we consider the problem of inferring generative dynamic network models based on network cascade diffusion data. We propose a novel framework for providing a non-parametric dynamic network model—based on a mixture of coupled hierarchical Dirichlet processes—based on data capturing cascade node infection times. Our approach allows us to infer the evolving community structure in networks and to obtain an explicit predictive distribution over the edges of the underlying network—including those that were not involved in transmission of any cascade, or are likely to appear in the future. We show the effectiveness of our approach using extensive experiments on synthetic as well as real-world networks.
@article{ghalebi2018dynamic, title = {Dynamic network model from partial observations}, author = {Ghalebi, Elahe and Mirzasoleiman, Baharan and Grosu, Radu and Leskovec, Jure}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {31}, year = {2018}, award = {Spotlight presentation (top 3%)} }

2017

ICML
Deletion-robust submodular maximization: Data summarization with “the right to be forgotten”

Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause

International Conference on Machine Learning (ICML), 2017

Abs Bib PDF Supp

How can we summarize a dynamic data stream when elements selected for the summary can be deleted at any time? This is an important challenge in online services, where the users generating the data may decide to exercise their right to restrict the service provider from using (part of) their data due to privacy concerns. Motivated by this challenge, we introduce the dynamic deletion-robust submodular maximization problem. We develop the first resilient streaming algorithm, called ROBUST-STREAMING, with a constant factor approximation guarantee to the optimum solution. We evaluate the effectiveness of our approach on several real-world applica tions, including summarizing (1) streams of geo-coordinates (2); streams of images; and (3) click-stream log data, consisting of 45 million feature vectors from a news recommendation task.
@article{mirzasoleiman2017deletion, title = {Deletion-robust submodular maximization: Data summarization with “the right to be forgotten”}, author = {Mirzasoleiman, Baharan and Karbasi, Amin and Krause, Andreas}, journal = {International Conference on Machine Learning (ICML)}, pages = {2449--2458}, year = {2017}, organization = {PMLR}, }
AISTATS
Guaranteed non-convex optimization: Submodular maximization over continuous domains

Andrew An Bian, Baharan Mirzasoleiman, Joachim Buhmann, and Andreas Krause

Artificial Intelligence and Statistics (AISTATS), 2017

Abs Bib PDF Supp

Submodular continuous functions are a category of (generally) non-convex/non-concave functions with a wide spectrum of applications. We characterize these functions and demonstrate that they can be maximized efficiently with approximation guarantees. Specifically, i) We introduce the weak DR property that gives a unified characterization of submodularity for all set, integer-lattice and continuous functions; ii) for maximizing monotone DR-submodular continuous functions under general down-closed convex constraints, we propose a Frank-Wolfe variant with (1-1/e) approximation guarantee, and sub-linear convergence rate; iii) for maximizing general non-monotone submodular continuous functions subject to box constraints, we propose a DoubleGreedy algorithm with 1/3 approximation guarantee. Submodular continuous functions naturally find applications in various real-world settings, including influence and revenue maximization with continuous assignments, sensor energy management, facility location, etc. Experimental results show that the proposed algorithms efficiently generate superior solutions compared to baseline algorithms.
@article{bian2017guaranteed, title = {Guaranteed non-convex optimization: Submodular maximization over continuous domains}, author = {Bian, Andrew An and Mirzasoleiman, Baharan and Buhmann, Joachim and Krause, Andreas}, journal = {Artificial Intelligence and Statistics (AISTATS)}, pages = {111--120}, year = {2017}, organization = {PMLR}, }

2016

ICML
Learning sparse combinatorial representations via two-stage submodular maximization

Eric Balkanski*, Baharan Mirzasoleiman*, Andreas Krause, and Yaron Singer

International Conference on Machine Learning (ICML), 2016

Abs Bib PDF Supp

We consider the problem of learning sparse representations of data sets, where the goal is to reduce a data set in manner that optimizes multiple objectives. Motivated by applications of data summarization, we develop a new model which we refer to as the two-stage submodular maximization problem. This task can be viewed as a combinatorial analogue of representation learning problems such as dictionary learning and sparse regression. The two-stage problem strictly generalizes the problem of cardinality constrained submodular maximization, though the objective function is not submodular and the techniques for submodular maximization cannot be applied. We describe a continuous optimization method which achieves an approximation ratio which asymptotically approaches 1-1/e. For instances where the asymptotics do not kick in, we design a local-search algorithm whose approximation ratio is arbitrarily close to 1/2. We empirically demonstrate the effectiveness of our methods on two multi-objective data summarization tasks, where the goal is to construct summaries via sparse representative subsets wrt to predefined objectives.
@article{balkanski2016learning, title = {Learning sparse combinatorial representations via two-stage submodular maximization}, author = {Balkanski*, Eric and Mirzasoleiman*, Baharan and Krause, Andreas and Singer, Yaron}, journal = {International Conference on Machine Learning (ICML)}, pages = {2207--2216}, year = {2016}, organization = {PMLR}, }
ICML
Fast constrained submodular maximization: Personalized data summarization

Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi

International Conference on Machine Learning (ICML), 2016

Abs Bib PDF Supp

Can we summarize multi-category data based on user preferences in a scalable manner? Many utility functions used for data summarization satisfy submodularity, a natural diminishing returns property. We cast personalized data summarization as an instance of a general submodular maximization problem subject to multiple constraints. We develop the first practical and FAst coNsTrained submOdular Maximization algorithm, FANTOM, with strong theoretical guarantees. FANTOM maximizes a submodular function (not necessarily monotone) subject to intersection of a p-system and l knapsacks constrains. It achieves a (1+ ε)(p+ 1)(2p+ 2l+ 1)/p approximation guarantee with only O(nrp log (n)/ε) query complexity (n and r indicate the size of the ground set and the size of the largest feasible solution, respectively). We then show how we can use FANTOM for personalized data summarization. In particular, a p-system can model different aspects of data, such as categories or time stamps, from which the users choose. In addition, knapsacks encode users’ constraints including budget or time. In our set of experiments, we consider several concrete applications: movie recommendation over 11K movies, personalized image summarization with 10K images, and revenue maximization on the YouTube social networks with 5000 communities. We observe that FANTOM constantly provides the highest utility against all the baselines.
@article{mirzasoleiman2016fast, title = {Fast constrained submodular maximization: Personalized data summarization}, author = {Mirzasoleiman, Baharan and Badanidiyuru, Ashwinkumar and Karbasi, Amin}, journal = {International Conference on Machine Learning (ICML)}, pages = {1358--1367}, year = {2016}, organization = {PMLR}, }
NeurIPS
Fast distributed submodular cover: Public-private data summarization

Baharan Mirzasoleiman, Morteza Zadimoghaddam, and Amin Karbasi

Advances in Neural Information Processing Systems (NeurIPS), 2016

Abs Bib PDF Supp

In this paper, we introduce the public-private framework of data summarization motivated by privacy concerns in personalized recommender systems and online social services. Such systems have usually access to massive data generated by a large pool of users. A major fraction of the data is public and is visible to (and can be used for) all users. However, each user can also contribute some private data that should not be shared with other users to ensure her privacy. The goal is to provide a succinct summary of massive dataset, ideally as small as possible, from which customized summaries can be built for each user, i.e. it can contain elements from the public data (for diversity) and users’ private data (for personalization). To formalize the above challenge, we assume that the scoring function according to which a user evaluates the utility of her summary satisfies submodularity, a widely used notion in data summarization applications. Thus, we model the data summarization targeted to each user as an instance of a submodular cover problem. However, when the data is massive it is infeasible to use the centralized greedy algorithm to find a customized summary even for a single user. Moreover, for a large pool of users, it is too time consuming to find such summaries separately. Instead, we develop a fast distributed algorithm for submodular cover, FASTCOVER, that provides a succinct summary in one shot and for all users. We show that the solution provided by FASTCOVER is competitive with that of the centralized algorithm with the number of rounds that is exponentially smaller than state of the art results. Moreover, we have implemented FASTCOVER with Spark to demonstrate its practical performance on a number of concrete applications, including personalized location recommendation, personalized movie recommendation, and dominating set on tens of millions of data points and varying number of users.
@article{mirzasoleiman2016fasu, title = {Fast distributed submodular cover: Public-private data summarization}, author = {Mirzasoleiman, Baharan and Zadimoghaddam, Morteza and Karbasi, Amin}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {29}, year = {2016}, }
JMLR
Distributed submodular maximization

Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause

The Journal of Machine Learning Research (JMLR), 2016

Abs Bib PDF

Many large-scale machine learning problems–clustering, non-parametric learning, kernel machines, etc.–require selecting a small yet representative subset from a large dataset. Such problems can often be reduced to maximizing a submodular set function subject to various constraints. Classical approaches to submodular optimization require centralized access to the full dataset, which is impractical for truly large-scale problems. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDi, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show that under certain natural conditions, performance close to the centralized approach can be achieved. We begin with monotone submodular maximization subject to a cardinality constraint, and then extend this approach to obtain approximation guarantees for (not necessarily monotone) submodular maximization subject to more general constraints including matroid or knapsack constraints. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop.
@article{mirzasoleiman2016distributed, title = {Distributed submodular maximization}, author = {Mirzasoleiman, Baharan and Karbasi, Amin and Sarkar, Rik and Krause, Andreas}, journal = {The Journal of Machine Learning Research (JMLR)}, volume = {17}, number = {1}, pages = {8330--8373}, year = {2016}, publisher = {JMLR. org}, }

2015

AAAI
Lazier than lazy greedy

Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause

AAAI Conference on Artificial Intelligence (AAAI), 2015

Abs Bib PDF

Is it possible to maximize a monotone submodular function faster than the widely used lazy greedy algorithm (also known as accelerated greedy), both in theory and practice? In this paper, we develop the first linear-time algorithm for maximizing a general monotone submodular function subject to a cardinality constraint. We show that our randomized algorithm, STOCHASTIC-GREEDY, can achieve a (1− 1/e− ε) approximation guarantee, in expectation, to the optimum solution in time linear in the size of the data and independent of the cardinality constraint. We empirically demonstrate the effectiveness of our algorithm on submodular functions arising in data summarization, including training large-scale kernel methods, exemplar-based clustering, and sensor placement. We observe that STOCHASTIC-GREEDY practically achieves the same utility value as lazy greedy but runs much faster. More surprisingly, we observe that in many practical scenarios STOCHASTIC-GREEDY does not evaluate the whole fraction of data points even once and still achieves indistinguishable results compared to lazy greedy.
@article{mirzasoleiman2015lazier, title = {Lazier than lazy greedy}, author = {Mirzasoleiman, Baharan and Badanidiyuru, Ashwinkumar and Karbasi, Amin and Vondr{\'a}k, Jan and Krause, Andreas}, journal = {AAAI Conference on Artificial Intelligence (AAAI)}, volume = {29}, number = {1}, year = {2015}, }
NeurIPS
Distributed submodular cover: Succinctly summarizing massive data

Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, and Andreas Krause

Advances in Neural Information Processing Systems (NeurIPS), 2015

Spotlight presentation (top 4%)

Abs Bib PDF Supp

How can one find a subset, ideally as possible, that well represents a massive dataset? Ie, its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. In this paper, we formalize this challenge as a submodular cover problem. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condition preva-lent in many data summarization applications. The classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution. However, this sequential, centralized approach is imprac-tical for truly large-scale problems. In this work, we develop the first distributed algorithm–DISCOVER–for submodular set cover that is easily implementable using MapReduce-style computations. We theoretically analyze our approach, and present approximation guarantees for the solutions returned by DISCOVER. We also study a natural trade-off between the communication cost and the num-ber of rounds required to obtain such a solution. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, includ-ing active set selection, exemplar based clustering, and vertex cover on tens of millions of data points using Spark.
@article{mirzasoleiman2015distributed, title = {Distributed submodular cover: Succinctly summarizing massive data}, author = {Mirzasoleiman, Baharan and Karbasi, Amin and Badanidiyuru, Ashwinkumar and Krause, Andreas}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {28}, year = {2015}, award = {Spotlight presentation (top 4%)} }

2014

KDD
Streaming submodular maximization: Massive data summarization on the fly

Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause

ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2014

Abs Bib PDF Supp

How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.
@article{badanidiyuru2014streaming, title = {Streaming submodular maximization: Massive data summarization on the fly}, author = {Badanidiyuru, Ashwinkumar and Mirzasoleiman, Baharan and Karbasi, Amin and Krause, Andreas}, journal = {ACM SIGKDD international conference on Knowledge discovery and data mining (KDD)}, pages = {671--680}, year = {2014}, }
NetSciCom
Modeling the impact of user awareness on immunization strategies

Baharan Mirzasoleiman, Hamid R Rabiee, and Mostafa Salehi

IEEE International Workshop on Network Science for Communication Networks (NetSciCom), 2014

Abs Bib PDF

Despite the efforts to design better antivirus software, malware continue to spread and cause enormous damages. Effect of immunizing computer systems as the most effective control policy for preventing such infections is two-fold. On one hand, it increases the global immunity of the network by providing indirect protection for unimmunized systems. On the other hand, raising the awareness of users from the possibility of infection can trigger behavioral changes by which users take measures to reduce their systems’ susceptibility using the antivirus software. Here, we propose the Behavior-Immunity model that allows measurement of vaccination effect based on the indirect protective effect of immunization strategies. It also provides a mean to utilize human behavioral changes to enhance the effectiveness of immunization strategies. In this work, we focus on the word of mouth as the source of user awareness and show that immunization schema can appropriately utilized the behavioral changes to practice better results. We also present a methodology for network immunization which is provably close to the optimal solution. Extensive computational experiments on some synthetic and real-world networks revealed that this strategy offers a significant improvement over well-studied targeted immunization method based on degree centrality.
@article{mirzasoleiman2014modeling, title = {Modeling the impact of user awareness on immunization strategies}, author = {Mirzasoleiman, Baharan and Rabiee, Hamid R and Salehi, Mostafa}, journal = {IEEE International Workshop on Network Science for Communication Networks (NetSciCom)}, year = {2014}, }

2013

SNAM
Revenue maximization in social networks through discounting

Mahmoudreza Babaei, Baharan Mirzasoleiman, Mahdi Jalili, and Mohammad Ali Safari

Social Network Analysis and Mining (SNAM), 2013

Abs Bib PDF

Social networking has become a part of daily life for many individuals across the world. Widespread adoption of various strategies in such networks can be utilized by business corporations as a powerful means for advertising. In this study, we investigated viral marketing strategies in which buyers are influenced by other buyers who already own an item. Since finding an optimal marketing strategy is NP-hard, a simple strategy has been proposed in which giving the item for free to a subset of influential buyers in a network increases the valuation of the other potential buyers for the item. In this study, we considered the more general problem by offering discounts instead of giving the item for free to an initial set of buyers. We introduced three approaches for finding an appropriate discount sequence based on the following iterative idea: In each step, we offer the item to the potential buyers with a discounted price in a way that they all accept the offers and buy the product. Selling the item to the most influential buyers as the opinion leaders increases the willingness of other buyers to pay a higher price. Thus, in the following steps, we can offer the item with a lower discount while still guaranteeing the acceptance of the offers. Furthermore, we investigated two marketing strategies based on local search and hill climbing algorithms. Extensive computational experiments on artificially constructed model networks as well as on a number of real-world networks revealed the effectiveness of the proposed discount-based strategies.
@article{babaei2013revenue, title = {Revenue maximization in social networks through discounting}, author = {Babaei, Mahmoudreza and Mirzasoleiman, Baharan and Jalili, Mahdi and Safari, Mohammad Ali}, journal = {Social Network Analysis and Mining (SNAM)}, volume = {3}, pages = {1249--1262}, year = {2013}, publisher = {Springer}, }
NeurIPS
Distributed submodular maximization: Identifying representative elements in massive data

Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause

Advances in Neural Information Processing Systems (NeurIPS), 2013

Abs Bib PDF Supp

Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable, representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference on tens of millions of examples using Hadoop.
@article{mirzasoleiman2013distributed, title = {Distributed submodular maximization: Identifying representative elements in massive data}, author = {Mirzasoleiman, Baharan and Karbasi, Amin and Sarkar, Rik and Krause, Andreas}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {26}, year = {2013}, }

2012

Europhys.Lett.
Immunizing complex networks with limited budget

Baharan Mirzasoleiman, Mahmoudreza Babaei, and Mahdi Jalili

Europhysics Letters, 2012

Abs Bib PDF

In this letter we studied the epidemic spreading on scale-free networks assuming a limited budget for immunization. We proposed a general model in which the immunity of an individual against the disease depends on its immunized friends in the network. Furthermore, we considered the possibility that each individual might be eager to pay a price to buy the vaccine and become immune against the disease. Under these assumptions we proposed an algorithm for improving the performance of all previous immunization algorithms. We also introduced a heuristic extension of the algorithm, which works well in scale-free networks.
@article{mirzasoleiman2012immunizing, title = {Immunizing complex networks with limited budget}, author = {Mirzasoleiman, Baharan and Babaei, Mahmoudreza and Jalili, Mahdi}, journal = {Europhysics Letters}, volume = {98}, number = {3}, pages = {38004}, year = {2012}, publisher = {IOP Publishing}, }

2011

Phys.Rev.E
Cascaded failures in weighted networks

Baharan Mirzasoleiman, Mahmoudreza Babaei, Mahdi Jalili, and MohammadAli Safari

Physical Review E, 2011

Abs Bib PDF

Many technological networks can experience random and/or systematic failures in their components. More destructive situations can happen if the components have limited capacity, where the failure in one of them might lead to a cascade of failures in other components, and consequently break down the structure of the network. In this paper, the tolerance of cascaded failures was investigated in weighted networks. Three weighting strategies were considered including the betweenness centrality of the edges, the product of the degrees of the end nodes, and the product of their betweenness centralities. Then, the effect of the cascaded attack was investigated by considering the local weighted flow redistribution rule. The capacity of the edges was considered to be proportional to their initial weight distribution. The size of the survived part of the attacked network was determined in model networks as well as in a number of real-world networks including the power grid, the internet in the level of autonomous system, the railway network of Europe, and the United States airports network. We found that the networks in which the weight of each edge is the multiplication of the betweenness centrality of the end nodes had the best robustness against cascaded failures. In other words, the case where the load of the links is considered to be the product of the betweenness centrality of the end nodes is favored for the robustness of the network against cascaded failures.
@article{mirzasoleiman2011cascaded, title = {Cascaded failures in weighted networks}, author = {Mirzasoleiman, Baharan and Babaei, Mahmoudreza and Jalili, Mahdi and Safari, MohammadAli}, journal = {Physical Review E}, volume = {84}, number = {4}, pages = {046114}, year = {2011}, publisher = {APS}, }
PLoS
Failure tolerance of motif structure in biological networks

Baharan Mirzasoleiman, and Mahdi Jalili

PLoS One, 2011

Abs Bib PDF

Complex networks serve as generic models for many biological systems that have been shown to share a number of common structural properties such as power-law degree distribution and small-worldness. Real-world networks are composed of building blocks called motifs that are indeed specific subgraphs of (usually) small number of nodes. Network motifs are important in the functionality of complex networks, and the role of some motifs such as feed-forward loop in many biological networks has been heavily studied. On the other hand, many biological networks have shown some degrees of robustness in terms of their efficiency and connectedness against failures in their components. In this paper we investigated how random and systematic failures in the edges of biological networks influenced their motif structure. We considered two biological networks, namely, protein structure network and human brain functional network. Furthermore, we considered random failures as well as systematic failures based on different strategies for choosing candidate edges for removal. Failure in the edges tipping to high degree nodes had the most destructive role in the motif structure of the networks by decreasing their significance level, while removing edges that were connected to nodes with high values of betweenness centrality had the least effect on the significance profiles. In some cases, the latter caused increase in the significance levels of the motifs.
@article{mirzasoleiman2011failure, title = {Failure tolerance of motif structure in biological networks}, author = {Mirzasoleiman, Baharan and Jalili, Mahdi}, journal = {PLoS One}, volume = {6}, number = {5}, pages = {e20512}, year = {2011}, publisher = {Public Library of Science San Francisco, USA}, }
ICC
Reuse-Attack Mitigation in Wireless Sensor Networks

Hossein Shafiei, Ahmad Khonsari, Baharan Mirzasoleiman, and Mohammad Ould-Khaoua

IEEE International Conference on Communications (ICC), 2011

Best paper award runner up

Abs Bib PDF

Privacy preservation in wireless sensor networks has drawn considerable attention from research community during last few years. Emergence of single-owner, multi-user commercial sensor networks along with hostile and uncontrollable environment of such networks, makes the security issue in such networks of a great importance. This paper concentrates on token-based privacy preservation schemes. A possible attack on such schemes is introduced and two different approaches are utilized to mitigate the attack. Mathematical models for considering the attack effect and overhead are presented and the results are verified using extensive simulations
@article{mirzasoleiman2011failurf, title = {Reuse-Attack Mitigation in Wireless Sensor Networks}, author = {Shafiei, Hossein and Khonsari, Ahmad and Mirzasoleiman, Baharan and Ould-Khaoua, Mohammad}, journal = {IEEE International Conference on Communications (ICC)}, year = {2011}, award = {Best paper award runner up} }

2009

ISPA
Utility proportional optimization flow control for overlay multicast

Ali Jafari, Hosein Shafiei, Baharan Mirzasoleiman, and Ghodrat Sepidnam

IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2009

Abs Bib PDF

A deniable authentication allows the receiver to identify the source of the received messages but cannot prove it to any third party. However, the deniability of the content, which is called restricted deniability in this paper, is concerned in electronic voting and some other similar application. At present, most non-interactive deniable authentication protocols cannot resist weaken key-compromise impersonation (W-KCI) attack. To settle this problem, a non-interactive identity-based restricted deniable authentication protocol is proposed. It not only can resist W-KCI attack but also has the properties of communication flexibility. It meets the security requirements such as correctness, restricted deniability as well. Therefore, this protocol can be applied in electronic voting.
@article{jafari2009utility, title = {Utility proportional optimization flow control for overlay multicast}, author = {Jafari, Ali and Shafiei, Hosein and Mirzasoleiman, Baharan and Sepidnam, Ghodrat}, journal = {IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)}, pages = {401--407}, year = {2009}, organization = {IEEE}, }

Thesis

Big data summarization using submodular functions

Baharan Mirzasoleiman

ETH Zurich, 2017, Thesis

Bib PDF

@phdthesis{mirzasoleiman2017big,
  title = {Big data summarization using submodular functions},
  author = {Mirzasoleiman, Baharan},
  year = {Thesis},
  school = {ETH Zurich, 2017}
}