Training with larger mini-batches improves the convergence rate and can yield superior performance. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs), due to the large GPU memory requirement. To address this problem, an effective approach is finding small mini-batch coresets that closely match the gradient of larger mini-batches. However, this approach becomes infeasible and ineffective for LLMs, due to the highly imbalanced nature of the sources in language data, use of the Adam optimizer, and the very large gradient dimensionality of LLMs. In this work, we address the above challenges by proposing Coresets for Training LLMs (CoLM). First, we show that mini-batch coresets found by gradient matching do not contain representative examples of the small sources w.h.p., and thus including all examples of the small sources in the mini-batch coresets is crucial for optimal performance. Second, we normalize the gradients by their historical exponential to find mini-batch coresets for training with Adam. Finally, we leverage zeroth-order methods to find smooth gradient of the last V -projection matrix and sparsify it to keep the dimensions with the largest normalized gradient magnitude. We apply CoLM to fine-tuning Phi-2, Phi-3, and Zephyr with LoRA on MathInstruct and SuperGLUE benchmark. Remarkably, CoLM reduces the memory requirement of fine-tuning by 2x and even outperforms training with 4x larger mini-batches. Notably, CoLM easily stack with existing memory-efficient training methods, such as LoRA.
ArXiv
Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks
Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naive application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data.
ArXiv
Towards Mitigating Spurious Correlations in the Wild: A Benchmark & a more Realistic Dataset
Deep neural networks often exploit non-predictive features that are spuriously correlated with class labels, leading to poor performance on groups of examples without such features. Despite the growing body of recent works on remedying spurious correlations, the lack of a standardized benchmark hinders reproducible evaluation and comparison of the proposed solutions. To address this, we present SpuCo, a python package with modular implementations of state-of-the-art solutions enabling easy and reproducible evaluation of current methods. Using SpuCo, we demonstrate the limitations of existing datasets and evaluation schemes in validating the learning of predictive features over spurious ones. To overcome these limitations, we propose two new vision datasets: (1) SpuCoMNIST, a synthetic dataset that enables simulating the effect of real world data properties e.g. difficulty of learning spurious feature, as well as noise in the labels and features; (2) SpuCoAnimals, a large-scale dataset curated from ImageNet that captures spurious correlations in the wild much more closely than existing datasets. These contributions highlight the shortcomings of current methods and provide a direction for future research in tackling spurious correlations. SpuCo, containing the benchmark and datasets, can be found at https://github.com/BigML-CS-UCLA/SpuCo, with detailed documentation available at https://spuco.readthedocs.io/en/latest/.
Can we modify the training data distribution to encourage the underlying optimization method toward finding solutions with superior generalization performance on in-distribution data? In this work, we approach this question for the first time by comparing the inductive bias of gradient descent (GD) with that of sharpness-aware minimization (SAM). By studying a two-layer CNN, we prove that SAM learns easy and difficult features more uniformly, particularly in early epochs. That is, SAM is less susceptible to simplicity bias compared to GD. Based on this observation, we propose USEFUL, an algorithm that clusters examples based on the network output early in training and upsamples examples with no easy features to alleviate the pitfalls of the simplicity bias. We show empirically that modifying the training data distribution in this way effectively improves the generalization performance on the original data distribution when training with (S)GD by mimicking the training dynamics of SAM. Notably, we demonstrate that our method can be combined with SAM and existing data augmentation strategies to achieve, to the best of our knowledge, state-of-the-art performance for training ResNet18 on CIFAR10, STL10, CINIC10, Tiny-ImageNet; ResNet34 on CIFAR100; and VGG19 and DenseNet121 on CIFAR10.
Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
Contrastive Language-Image Pre-training (CLIP) on large image-caption datasets has achieved remarkable success in zero-shot classification and enabled transferability to new domains. However, CLIP is extremely more vulnerable to targeted data poisoning and backdoor attacks, compared to supervised learning. Perhaps surprisingly, poisoning 0.0001% of CLIP pre-training data is enough to make targeted data poisoning attacks successful. This is four orders of magnitude smaller than what is required to poison supervised models. Despite this vulnerability, existing methods are very limited in defending CLIP models during pre-training. In this work, we propose a strong defense, SAFECLIP, to safely pre-train CLIP against targeted data poisoning and backdoor attacks. SAFECLIP warms up the model by applying unimodal contrastive learning (CL) on image and text modalities separately. Then, it divides the data into safe and risky sets, by applying a Gaussian Mixture Model to the cosine similarity of image-caption pair representations. SAFECLIP pre-trains the model by applying the CLIP loss to the safe set and applying unimodal CL to image and text modalities of the risky set separately. By gradually increasing the size of the safe set during pre-training, SAFECLIP effectively breaks targeted data poisoning and backdoor attacks without harming the CLIP performance. Our extensive experiments on CC3M, Visual Genome and MSCOCO demonstrate that SAFECLIP significantly reduces the success rate of targeted data poisoning attacks from 93.75% to 0% and that of various backdoor attacks from up to 100% to 0%, without harming CLIP’s performance.
Pretrained machine learning models need to be adapted to distribution shifts when deployed in new target environments. When obtaining labeled data from the target distribution is expensive, few-shot adaptation with only a few examples from the target distribution becomes essential. In this work, we propose MixPro, a lightweight and highly data-efficient approach for few-shot adaptation. MixPro first generates a relatively large dataset by mixing (linearly combining) pre-trained embeddings of large source data with those of the few target examples. This process preserves important features of both source and target distributions, while mitigating the specific noise in the small target data. Then, it trains a linear classifier on the mixed embeddings to effectively adapts the model to the target distribution without overfitting the small target data. Theoretically, we demonstrate the advantages of MixPro over previous methods. Our experiments, conducted across various model architectures on 8 datasets featuring different types of distribution shifts, reveal that MixPro can outperform baselines by as much as 7%, with only 2-4 target examples.
We present NeWRF, a deep learning framework for predicting wireless channels. Wireless channel prediction is a long-standing problem in the wireless community and is a key technology for improving the coverage of wireless network deployments. Today, a wireless deployment is evaluated by a site survey which is a cumbersome process requiring an experienced engineer to perform extensive channel measurements. To reduce the cost of site surveys, we develop NeWRF, which is based on recent advances in Neural Radiance Fields (NeRF). NeWRF trains a neural network model with a sparse set of channel measurements, and predicts the wireless channel accurately in any location in the site. We introduce a series of techniques that integrate wireless propagation properties into the NeWRF framework to account for the fundamental differences between the behavior of light and wireless signals. We conduct extensive evaluations of our framework and show that our approach can accurately predict channels at unvisited locations with significantly lower measurement density than the prior state-of-the-art.
UAI
Graph Contrastive Learning under Heterophily via Graph Filters
Graph contrastive learning (CL) methods learn node representations in a self-supervised manner by maximizing the similarity between the augmented node representations obtained via a GNN-based encoder. However, CL methods perform poorly on graphs with heterophily, where connected nodes tend to belong to different classes. In this work, we address this problem by proposing an effective graph CL method, namely HLCL, for learning graph representations under heterophily. HLCL first identifies a homophilic and a heterophilic subgraph based on the cosine similarity of node features. It then uses a low-pass and a high-pass graph filter to aggregate representations of nodes connected in the homophilic subgraph and differentiate representations of nodes in the heterophilic subgraph. The final node representations are learned by contrasting both the augmented high-pass filtered views and the augmented low-pass filtered node views. Our extensive experiments show that HLCL outperforms state-ofthe-art graph CL methods on benchmark datasets with heterophily, as well as large-scale real-world graphs, by up to 7%, and outperforms graph supervised learning methods on datasets with heterophily by up to 10%.
UAI
Investigating the Impact of Model Width and Density on Generalization in Presence of Label Noise
Increasing the size of overparameterized neural networks has been a key in achieving state-of-the-art performance. This is captured by the double descent phenomenon, where the test loss follows a decreasing-increasing-decreasing pattern (or sometimes monotonically decreasing) as model width increases. However, the effect of label noise on the test loss curve has not been fully explored. In this work, we uncover an intriguing phenomenon where label noise leads to a final ascent in the originally observed double descent curve. Specifically, under a sufficiently large noise-to-sample-size ratio, optimal generalization is achieved at intermediate widths. Through theoretical analysis, we attribute this phenomenon to the shape transition of test loss variance induced by label noise. Furthermore, we extend the final ascent phenomenon to model density and provide the first theoretical characterization showing that reducing density by randomly dropping trainable parameters improves generalization under label noise. We also thoroughly examine the roles of regularization and sample size. Surprisingly, we find that larger l2 regularization and robust learning methods against label noise exacerbate the final ascent. We confirm the validity of our findings through extensive experiments on ReLu networks trained on MNIST, ResNets/ViTs trained on CIFAR-10/100, and InceptionResNet-v2 trained on Stanford Cars with real-world noisy labels.
Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP’s performance than increasing its volume. Nevertheless, finding a subset of image-caption pairs that provably generalizes on par with the full data when trained on, has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that best preserve the cross-covariance of the images and captions of the full data best preserve CLIP’s generalization performance. Our extensive experiments on ConceptualCaptions3M demonstrates that subsets of size 5%-10% found by ClipCov over 150% and 40% the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subset exhibits average relative performance improvement over the next best baseline of nearly 50% across 14 downstream datasets.
Neural networks trained with (stochastic) gradient descent have an inductive bias towards learning simpler solutions. This makes them highly prone to learning spurious correlations in the training data, that may not hold at test time. In this work, we provide the first theo- retical analysis of the effect of simplicity bias on learning spurious correlations. Notably, we show that examples with spurious features are provably separable based on the model’s output early in training. We further illustrate that if spurious features have a small enough noise-to-signal ratio, the network’s output on majority of examples is almost exclusively determined by the spurious features, leading to poor worst-group test accuracy. Finally, we propose Spare, which identifies spurious correlations early in training, and utilizes importance sampling to alleviate their effect. Empirically, we demonstrate that Spare outperforms state-of-the-art methods by up to 21.1% in worst-group accuracy, while being up to 12x faster. We also show that Spare is a highly effective but lightweight method to discover spurious correlations. Code is available at https://github.com/BigML-CS-UCLA/SPARE.
ICLR
Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift
Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP (Radford et al., 2021), have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL’s robustness: intra-class contrasting, which allows the model to learn features with a high variance, and inter-class feature sharing, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO (Lin et al., 2014)/Conceptual Captions (Sharma et al., 2018) and evaluating them on shifted ImageNets.
ICLR
Investigating the Benefits of Projection Head for Representation Learning
An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normal- ized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.
ICLR
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
Dataset distillation aims to minimize the time and memory needed for training deep networks on large datasets, by creating a small set of synthetic images that has a similar generalization performance to that of the full dataset. However, current dataset distillation techniques fall short, showing a notable performance gap compared to training on the original data. In this work, we are the first to argue that the use of only one synthetic subset for distillation may not yield optimal generalization performance. This is because the training dynamics of deep networks drastically changes during training. Therefore, multiple synthetic subsets are required to capture the dynamics of training in different stages. To address this issue, we propose Progressive Dataset Distillation (PDD). PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets without requiring additional training time. Our extensive experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%. In addition, our method for the first time enables generating considerably larger synthetic datasets. Our codes are available at https://github.com/VITA-Group/ProgressiveDD.
Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image- caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of targeted data poisoning and backdoor attacks. Despite this vulnerability, robust contrastive vision-language pre-training against such attacks has remained unaddressed. In this work, we propose ROCLIP, the first effective method for robust pre-training multimodal vision-language models against targeted data poi- soning and backdoor attacks. ROCLIP effectively breaks the association between poisoned image-caption pairs by considering a relatively large and varying pool of random captions, and matching every image with the text that is most similar to it in the pool instead of its own caption, every few epochs.It also leverages image and text augmentations to further strengthen the defense and improve the performance of the model. Our extensive experiments show that ROCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models. In particular, ROCLIP decreases the success rate for targeted data poisoning attacks from 93.75% to 12.5% and that of backdoor attacks down to 0%, while improving the model’s linear probe performance by 10% and maintains a similar zero shot performance compared to CLIP. By increasing the frequency of matching, ROCLIP is able to defend strong attacks, which add up to 1% poisoned examples to the data, and successfully maintain a low attack success rate of 12.5%, while trading off the performance on some tasks.
While deep learning models have shown remarkable performance in various tasks, they are susceptible to learning non-generalizable spurious features rather than the core features that are genuinely correlated to the true label. In this paper, beyond existing analyses of linear models, we theoretically examine the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. In light of this, we propose a new training algorithm called PDE that efficiently enhances the model’s robustness for a better worst-group performance. PDE begins with a group-balanced subset of training data and progressively expands it to facilitate the learning of the core features. Experiments on synthetic and real-world benchmark datasets confirm the superior performance of our method on models such as ResNets and Transformers. On average, our method achieves a 2.8% improvement in worst-group accuracy compared with the state-of-the-art method, while enjoying up to 10⇥ faster training efficiency.
J. Affect. Disord.
Sleep, Brain Systems, and Persistent Stress in Early Adolescents During COVID-19: Insights from the ABCD Study
Orsolya Kiss, Zihan Qu, Eva M. Müller-Oehring, Fiona C. Baker, and Baharan Mirzasoleiman
Purpose: The first year of the COVID-19 pandemic constituted a major life stress event for many adolescents, associated with disrupted school, behaviors, social networks, and health concerns. However, pandemic-related stress was not equivalent for everyone and could have been influenced by pre-pandemic factors including brain structure and sleep, which both undergo substantial development during adolescence. Here, we analyzed clusters of perceived stress levels across the pandemic and determined developmentally relevant pre-pandemic risk factors in brain structure and sleep of persistently high stress during the first year of the COVID-19 pandemic. Methods: We investigated longitudinal changes in perceived stress at six timepoints across the first year of the pandemic (May 2020–March 2021) in 5559 adolescents (50% female; age range: 11–14 years) in the United States (U.S.) participating in the Adolescent Brain Cognitive Development (ABCD) study. In 3141 of these adolescents, we fitted machine learning models to identify the most important pre-pandemic predictors from structural MRI brain measures and self-reported sleep data that were associated with persistently high stress across the first year of the pandemic. Results: Patterns of perceived stress levels varied across the pandemic, with 5% reporting persistently high stress. Our classifiers accurately detected persistently high stress (AUC > 0.7). Pre-pandemic brain structure, specif- ically cortical volume in temporal regions, and cortical thickness in multiple parietal and occipital regions, predicted persistent stress. Pre-pandemic sleep difficulties and short sleep duration were also strong predictors of persistent stress, along with more advanced pubertal stage. Conclusions: Adolescents showed variable stress responses during the first year of the COVID-19 pandemic, and some reported persistently high stress across the whole first year. Vulnerability to persistent stress was evident in several brain structural and self-reported sleep measures, collected before the pandemic, suggesting the relevance of other pre-existing individual factors beyond pandemic-related factors, for persistently high stress responses.
Spurious correlations that degrade model generalization or lead the model to be right for the wrong reasons are one of the main robustness concerns for real-world deployments. However, mitigating these correlations during pre-training for large-scale models can be costly and impractical, particularly for those without access to high-performance computing resources. This paper proposes a novel approach to address spurious correlations during fine-tuning for a given domain of interest. With a focus on multi-modal models (e.g., CLIP), the proposed method leverages different modalities in these models to detect and explicitly set apart spurious attributes from the affected class, achieved through a multi-modal contrastive loss function that expresses spurious relationships through language. Our experimental results and in-depth visualizations on CLIP show that such an intervention can effectively i) improve the model’s accuracy when spurious attributes are not present, and ii) directs the model’s activation maps towards the actual class rather than the spurious attribute when present. In particular, on the Waterbirds dataset, our algorithm achieved a worst-group accuracy 23% higher than ERM on CLIP with a ResNet-50 backbone, and 32% higher on CLIP with a ViT backbone, while maintaining the same average accuracy as ERM.
Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data. As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations. This enables efficient SSL by reducing the volume of data required for learning high-quality representations. Nevertheless, quantifying the value of examples for SSL has remained an open question. In this work, we address this for the first time, by proving that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples, in expectation. We provide rigorous guarantees for the generalization performance of SSL on such subsets. Empirically, we discover, perhaps surprisingly, the subsets that contribute the most to SSL are those that contribute the least to supervised learning. Through extensive experiments, we show that our subsets outperform random subsets by more than 3% on CIFAR100, CIFAR10, and STL10. Interestingly, we also find that we can safely exclude 20% of examples from CIFAR100 and 40% from STL10, without affecting downstream task performance.
Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder class-relevant features by focusing on learning easy class-irrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of class collapse or feature suppression at test time. We provide the first unified theoretically rigorous framework to determine which features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder classrelevant features. Moreover, we present increasing embedding dimensionality and improving the quality of data augmentations as two theoretically motivated solutions to feature suppression. We also provide the first theoretical explanation for why employing supervised and unsupervised CL together yields higher-quality representations, even when using commonly-used stochastic gradient methods.
To improve the efficiency and sustainability of learning deep models, we propose CREST, the first scalable framework with rigorous theoretical guarantees to identify the most valuable examples for training non-convex models, particularly deep networks. To guarantee convergence to a stationary point of a non-convex function, CREST models the non-convex loss as a series of quadratic functions and extracts a coreset for each quadratic sub-region. In addition, to ensure faster convergence of stochastic gradient methods such as (mini-batch) SGD, CREST iteratively extracts multiple mini-batch coresets from larger random subsets of training data, to ensure nearly-unbiased gradients with small variances. Finally, to further improve scalability and efficiency, CREST identifies and excludes the examples that are learned from the coreset selection pipeline. Our extensive experiments on several deep networks trained on vision and NLP datasets, including CIFAR-10, CIFAR-100, TinyImageNet, and SNLI, confirm that CREST speeds up training deep networks on very large datasets, by 1.7x to 2.5x with minimum loss in the performance. By analyzing the learning difficulty of the subsets selected by CREST, we show that deep models benefit the most by learning from subsets of increasing difficulty levels.
HotStorage
NeSSA: Near-Storage Data Selection for Accelerated Machine Learning Training
Large-scale machine learning (ML) models rely on extremely large datasets to learn their exponentially growing number of parameters. While these models achieve unprecedented success, the increase in training time and hardware resources required is unsustainable. Further, we find that as dataset sizes increase, data movement becomes a significant com- ponent of overall training time. We propose NeSSA, a novel SmartSSD+GPU training architecture to intelligently select important subsets of large datasets near-storage, such that training on the subset mimics training on the full dataset with a very small loss in accuracy. To the best of our knowl- edge, this is the first work to propose such a near-storage data selection model for efficient ML training. We have evalu- ated our method for the CIFAR-10, SVHN, CINIC-10, CIFAR- 100, TinyImageNet, and ImageNet-100 datasets. We also test across ResNet-20, ResNet-18, and ResNet-50 models.
We consider maximization of stochastic monotone continuous submodular functions (CSF) with a diminishing return property. Existing algorithms only guarantee the performance in expectation, and do not bound the probability of getting a bad solution. This implies that for a particular run of the algorithms, the solution may be much worse than the provided guarantee in expectation. In this paper, we first empirically verify that this is indeed the case. Then, we provide the first high-probability analysis of the existing methods for stochastic CSF maximization, namely PGA, boosted PGA, SCG, and SCG++. Finally, we provide an improved high-probability bound for SCG, under slightly stronger assumptions, with a better convergence rate than that of the expected solution. Through extensive experiments on non-concave quadratic programming (NQP) and optimal budget allocation, we confirm the validity of our bounds and show that even in the worst-case, PGA converges to OPT/2, and boosted PGA, SCG, SCG++ converge to (1-1/e)OPT, but at a slower rate than that of the expected solution.
ICDH
A Self-supervised Framework for Improved Data-Driven Monitoring of Stress via Multi-modal Passive Sensing
Recent advances in remote health monitoring systems have significantly benefited patients and played a crucial role in improving their quality of life. However, while physiological health-focused solutions have demonstrated increasing success and maturity, mental health-focused applications have seen comparatively limited success in spite of the fact that stress and anxiety disorders are among the most common issues people deal with in their daily lives. In the hopes of furthering progress in this domain through the development of a more robust analytic framework for the measurement of indicators of mental health, we propose a multi-modal semi-supervised framework for tracking physiological precursors of the stress response. Our methodology enables utilizing multi-modal data of differing domains and resolutions from wearable devices and leveraging them to map short-term episodes to semantically efficient embeddings for a given task. Additionally, we leverage an inter-modality contrastive objective, with the advantages of rendering our framework both modular and scalable. The focus on optimizing both local and global aspects of our embeddings via a hierarchical structure renders transferring knowledge and compatibility with other devices easier to achieve. In our pipeline, a task-specific pooling based on an attention mechanism, which estimates the contribution of each modality on an instance level, computes the final embeddings for observations. This additionally provides a thorough diagnostic insight into the data characteristics and highlights the importance of signals in the broader view of predicting episodes annotated per mental health status. We perform training experiments using a corpus of realworld data on perceived stress, and our results demonstrate the efficacy of the proposed approach in performance improvements
TKDE
On the fairness of time-critical influence maximization in social networks
Influence maximization has found applications in a wide range of real-world problems, for instance, viral marketing of products in an online social network, and propagation of valuable information such as job vacancy advertisements. While existing algorithmic techniques usually aim at maximizing the total number of people influenced, the population often comprises several socially salient groups, e.g., based on gender or race. As a result, these techniques could lead to disparity across different groups in receiving important information. Furthermore, in many applications, the spread of influence is time-critical, i.e., it is only beneficial to be influenced before a deadline. As we show in this paper, such time-criticality of information could further exacerbate the disparity of influence across groups. This dis- parity could have far-reaching consequences, impacting people’s prosperity and putting minority groups at a big disadvantage. In this work, we propose a notion of group fairness in time- critical influence maximization. We introduce surrogate objective functions to solve the influence maximization problem under fair- ness considerations. By exploiting the submodularity structure of our objectives, we provide computationally efficient algorithms with guarantees that are effective in enforcing fairness during the propagation process. Extensive experiments on synthetic and real-world datasets demonstrate the efficacy of our proposal.
A powerful category of (invisible) data poisoning attacks modify a subset of training examples by small adversarial perturbations to change the prediction of certain test-time data. Existing defense mechanisms are not desirable to deploy in practice, as they often either drastically harm the generalization performance, or are attack-specific, and prohibitively slow to apply. Here, we propose a simple but highly effective approach that unlike existing methods breaks various types of invisible poisoning attacks with the slightest drop in the generalization performance. We make the key observation that attacks introduce local sharp regions of high training loss, which when minimized, results in learning the adversarial perturbations and makes the attack successful. To break poisoning attacks, our key idea is to alleviate the sharp loss regions introduced by poisons. To do so, our approach comprises two components: an optimized friendly noise that is generated to maximally perturb examples without degrading the performance, and a randomly varying noise component. The combination of both components builds a very light-weight but extremely effective defense against the most powerful triggerless targeted and hidden-trigger backdoor poisoning attacks, including Gradient Matching, Bulls-eye Polytope, and Sleeper Agent. We show that our friendly noise is transferable to other architectures, and adaptive attacks cannot break our defense due to its random noise component.
Data augmentation is essential to achieve state-of-the-art performance in many deep learning applications. However, the most effective augmentation techniques become computationally prohibitive for even medium-sized datasets. To address this, we propose a rigorous technique to select subsets of data points that when augmented, closely capture the training dynamics of full data augmentation. We first show that data augmentation, modeled as additive perturbations, improves learning and generalization by relatively enlarging and perturbing the smaller singular values of the network Jacobian, while preserving its prominent directions. This prevents overfitting and enhances learning the harder to learn information. Then, we propose a framework to iteratively extract small subsets of training data that when augmented, closely capture the alignment of the fully augmented Jacobian with labels/residuals. We prove that stochastic gradient descent applied to the augmented subsets found by our approach has similar training dynamics to that of fully augmented data. Our experiments demonstrate that our method achieves 6.3x speedup on CIFAR10 and 2.2x speedup on SVHN, and outperforms the baselines by up to 10% across various subset sizes. Similarly, on TinyImageNet and ImageNet, our method beats the baselines by up to 8%, while achieving up to 3.3x speedup across various subset sizes. Finally, training on and augmenting 50% subsets using our method on a version of CIFAR10 corrupted with label noise even outperforms using the full dataset.
Data poisoning causes misclassification of test time target examples, by injecting maliciously crafted samples in the training data. Existing defenses are often effective only against a specific type of targeted attack, significantly degrade the generalization performance, or are prohibitive for standard deep learning pipelines. In this work, we propose an efficient defense mechanism that significantly reduces the success rate of various data poisoning attacks, and provides theoretical guarantees for the performance of the model. Targeted attacks work by adding bounded perturbations to a randomly selected subset of training data to match the targets’ gradient or representation. We show that: (i) under bounded perturbations, only a number of poisons can be optimized to have a gradient that is close enough to that of the target and make the attack successful; (ii) such effective poisons move away from their original class and get isolated in the gradient space; (iii) dropping examples in low-density gradient regions during training can successfully eliminate the effective poisons, and guarantees similar training dynamics to that of training on full data. Our extensive experiments show that our method significantly decreases the success rate of state-of-the-art targeted attacks, including Gradient Matching and Bullseye Polytope, and easily scales to large datasets.
Training machine learning models on massive datasets incurs substantial computational costs. To alleviate such costs, there has been a sustained effort to develop data-efficient training methods that can carefully select subsets of the training examples that generalize on par with the full training data. However, existing methods are limited in providing theoretical guarantees for the quality of the models trained on the extracted subsets, and may perform poorly in practice. We propose AdaCore, a method that leverages the geometry of the data to extract subsets of the training examples for efficient machine learning. The key idea behind our method is to dynamically approximate the curvature of the loss function via an exponentially-averaged estimate of the Hessian to select weighted subsets (coresets) that provide a close approximation of the full gradient preconditioned with the Hessian. We prove rigorous guarantees for the convergence of various first and second-order methods applied to the subsets chosen by AdaCore. Our extensive experiments show that AdaCore extracts coresets with higher quality compared to baselines and speeds up training of convex and non-convex machine learning models, such as logistic regression and neural networks, by over 2.9 x over the full data and 4.5 x over random subsets.
Self-supervised Contrastive Learning (CL) has been recently shown to be very effective in preventing deep networks from overfitting noisy labels. Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that the representation matrix learned by contrastive learning boosts robustness, by having:(i) one prominent singular value corresponding to each sub-class in the data, and significantly smaller remaining singular values; and (ii) a large alignment between the prominent singular vectors and the clean labels of each sub-class. The above properties enable a linear layer trained on such representations to effectively learn the clean labels without overfitting the noise. We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve state-of-the-art performance under extreme noise levels, eg, an average of 27.18% and 15.58% increase in accuracy on CIFAR-10 and CIFAR-100 with 80% symmetric noisy labels, and 4.11% increase in accuracy on WebVision.
Syn.Data4ML
Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization
Generative models have the ability to synthesize data points drawn from the data distribution, however, not all generated samples are high quality. In this paper, we propose using a combination of coresets selection methods and “entropic regularization” to select the highest fidelity samples. We leverage an Energy-Based Model which resembles a variational auto-encoder with an inference and generator model for which the latent prior is complexified by an energy-based model. In a semi-supervised learning scenario, we show that augmenting the labeled data-set, by adding our selected subset of samples, leads to better accuracy improvement rather than using all the synthetic samples.
BIBM
Passive Monitoring of Physiological Precursors of Stress Leveraging Smartwatch Data
Developing the capability to continuously and noninvasively monitor the mental health status of individuals is a critical focus in the mHealth domain. The use of passivelygenerated data gathered via smart and portable electronic devices to monitor specific indicators of mental health has shown potential to serve as an effective alternative to traditional intrusive survey-based approaches to monitoring mental health remotely. In this study, we propose a remote health monitoring framework for dynamic, flexible, and scalable assessment and detection of physiological precursors of a stress response. Our method comprises a smartwatch-based system for continuous monitoring of primary physiological signals, followed by a deep neural network architecture that performs the fusion and processing of the multi-modal sensor readings. We empirically validate our system on a cohort of university-affiliated members of the military. Our findings demonstrate the effectiveness of our passive-sensing system for tracking perceived stress, the results of which can be used to obtain a better understanding of patient behavior and improve personalized treatments.
EAAMO
Towards Balanced Information Propagation in Social Media
As people increasingly rely on social media platforms such as Twitter to consume information, there are significant concerns about the diversity of news consumption. Users may narrow their attention to posts which reinforce their pre-existing views, which could lead to a more fragmented society. Aiming to combat this, earlier work divided news on a given story into high consensus and low consensus posts, based on how similar reactions can be expected from users with different political views: high consensus news elicits similar reactions, whereas low consensus news elicits different reactions from readers depending on their political leanings. In this work, we propose and quantify the benefits of a strategy to spread high consensus news across readers with diverse political leanings. We first compile a dataset and make the following three key observations: (1) low consensus news is more likely to remain within subgroups of users with similar political leanings, whereas high consensus news spreads more across subgroups; (2) high consensus news posted by neutral publishers spreads more equally across subgroups; and (3) users that get the information from other users instead of the publishers, get an even more biased exposure to news. Then, we propose a strategy that spreads high consensus news through neutral publishers, and quantify the significant decrease in the disparity of users’ news exposure. Our extensive experiments on Twitter shows that seeding high consensus information with neutral publishers is an effective way to achieve high spread with little disparity regarding political leaning.
CompBio
Purification of single-cell transcriptomics data with coreset selection
Despite the overall success of single-cell transcriptomics, variations in the number of cells captured from biological replicates in different regions of the embedding space of cells limit the interpretation of downstream computational analyses. Here we introduce a coreset selection based purification method to alleviate potential replicate specific biases within single-cell datasets. We first identify regions of the embedding space of cells that are not biased towards single biological replicates, and then extract a representative cell subset (coreset) covering them. We demonstrate that the extracted coresets provide a solid ground for downstream analyses. Specifically, we show that differential gene expression signatures based on purified datasets are robust against replicate specific biases across 24 different cell-type specific single-cell datasets. Furthermore, we highlight that purification can enhance supervised learning from single-cell transcriptomics data. Our results indicate substantial improvement in predictive performance (up to 0.16 gain in AUC) when testing logistic regression models on 8 cell type specific datasets across two independent cohorts.
Dynamic evolving networks capture temporal relations in domains such as social networks, communication networks, and financial transaction networks. In such networks, temporal motifs, which are repeated sequences of time-stamped edges/transactions, offer valuable information about the networks’ evolution and function. However, calculating temporal motif frequencies is computationally expensive as it requires: First, identifying all instances of the static motifs in the static graph induced by the temporal graph. And second, counting the number of subsequences of temporal edges that correspond to a temporal motif and occur within a time window. Since the number of temporal motifs changes over time, finding interesting temporal patterns involves iterative application of the above process over many consecutive time windows. This makes it impractical to scale to large real temporal networks. Here, we develop a fast and accurate model-based method for counting motifs in temporal networks. We first develop the Temporal Activity State Block Model (TASBM), to model temporal motifs in temporal graphs. Then we derive closed-form analytical expressions that allow us to quickly calculate expected motif frequencies and their variances in a given temporal network. Finally, we develop an efficient model fitting method, so that for a given network, we quickly fit the TASMB model and compute motif frequencies. We apply our approach to two real-world networks: a network of financial transactions and an email network. Experiments show that our TASMB framework (1) accurately counts temporal motifs in temporal networks; (2) easily scales to networks with tens of millions of edges/transactions; (3) is about 50x faster than explicit motif counting methods on networks of about 5 million temporal edges, a factor which increases with network size.
Neural networks have become very widespread due to the mainstream availability of computational devices such as GPUs, and as these devices become more powerful, these networks have become much larger. With the growing demand for fast, efficient networks, weight pruning has become a popular technique for reducing both the speed and computational time of these networks, but they introduce sparse matrices, which can be tedious to implement properly. In this paper, we investigate a different approach to model pruning involving low rank decompositions and output perturbation.
The potential for machine learning systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. Much recent work has focused on developing algorithmic tools to assess and mitigate such unfairness. However, there is little work on enhancing fairness in graph algorithms. Here, we develop a simple, effective and general method, CrossWalk, that enhances fairness of various graph algorithms, including influence maximization, link prediction and node classification, applied to node embeddings. CrossWalk is applicable to any random walk based node representation learning algorithm, such as DeepWalk and Node2Vec. The key idea is to bias random walks to cross group boundaries, by upweighting edges which (1) are closer to the groups’ peripheries or (2) connect different groups in the network. CrossWalk pulls nodes that are near groups’ peripheries towards their neighbors from other groups in the embedding space, while preserving the necessary structural properties of the graph. Extensive experiments show the effectiveness of our algorithm to enhance fairness in various graph algorithms, including influence maximization, link prediction and node classification in synthetic and real networks, with only a very small decrease in performance.
ICDE
On the fairness of time-critical influence maximization in social networks
Influence maximization has found applications in a wide range of real-world problems, for instance, viral marketing of products in an online social network, and propagation of valuable information such as job vacancy advertisements. While existing algorithmic techniques usually aim at maximizing the total number of people influenced, the population often comprises several socially salient groups, e.g., based on gender or race. As a result, these techniques could lead to disparity across different groups in receiving important information. Furthermore, in many applications, the spread of influence is time-critical, i.e., it is only beneficial to be influenced before a deadline. As we show in this paper, such time-criticality of information could further exacerbate the disparity of influence across groups. This dis- parity could have far-reaching consequences, impacting people’s prosperity and putting minority groups at a big disadvantage. In this work, we propose a notion of group fairness in time- critical influence maximization. We introduce surrogate objective functions to solve the influence maximization problem under fair- ness considerations. By exploiting the submodularity structure of our objectives, we provide computationally efficient algorithms with guarantees that are effective in enforcing fairness during the propagation process. Extensive experiments on synthetic and real-world datasets demonstrate the efficacy of our proposal.
2020
UAI
Coresets for estimating means and mean square error with limited greedy samples
In a number of situations, collecting a function value for every data point may be prohibitively expensive, and random sampling ignores any structure in the underlying data. We introduce a scalable optimization algorithm with no correction steps (in contrast to Frank–Wolfe and its variants), a variant of gradient ascent for coreset selection in graphs, that greedily selects a weighted subset of vertices that are deemed most important to sample. Our algorithm estimates the mean of the function by taking a weighted sum only at these vertices, and we provably bound the estimation error in terms of the location and weights of the selected vertices in the graph. In addition, we consider the case where nodes have different selection costs and provide bounds on the quality of the low-cost selected coresets. We demonstrate the benefits of our algorithm on the semi-supervised node classification of graph convolutional neural network, point clouds and structured graphs, as well as sensor placement where the cost of placing sensors depends on the location of the placement. We also elucidate that the empirical convergence of our proposed method is faster than random selection and various clustering methods while still respecting sensor placement cost. The paper concludes with validation of the developed algorithm on both synthetic and real datasets, demonstrating that it outperforms the current state of the art.
Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near) optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.
Modern neural networks have the capacity to overfit noisy labels frequently found in real-world datasets. Although great progress has been made, existing techniques are very limited in providing theoretical guarantees for the performance of the neural networks trained with noisy labels. To tackle this challenge, we propose a novel approach with strong theoretical guarantees for robust training of neural networks trained with noisy labels. The key idea behind our method is to select subsets of clean data points that provide an approximately low-rank Jacobian matrix. We then prove that gradient descent applied to the subsets cannot overfit the noisy labels, without regularization or early stopping. Our extensive experiments corroborate our theory and demonstrate that deep networks trained on our subsets achieve a significantly superior performance, e.g., 7% increase in accuracy on mini Webvision with 50% noisy labels, compared to state-of-the art.
ICLR
Selection via Proxy: Efficient Data Selection for Deep Learning
Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia
International Conference on Learning Representations (ICLR), 2020
Data selection methods, such as active learning and core-set selection, are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because they depend on feature representations that need to be learned. In this work, we show that we can greatly improve the computational efficiency by using a small proxy model to perform data selection (e.g., selecting data points to label for active learning). By removing hidden layers from the target model, using smaller architectures, and training for fewer epochs, we create proxies that are an order of magnitude faster to train. Although these small proxy models have higher error rates, we find that they empirically provide useful signals for data selection. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks across five datasets: CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity, and Amazon Review Full. For active learning, applying SVP can give an order of magnitude improvement in data selection runtime (i.e., the time it takes to repeatedly train and select points) without significantly increasing the final error (often within 0.1%). For core-set selection on CIFAR10, proxies that are over 10x faster to train than their larger, more accurate targets can remove up to 50% of the data without harming the final accuracy of the target, leading to a 1.6x end-to-end training time improvement.
The need for real time analysis of rapidly producing data streams (eg, video and image streams) motivated the design of streaming algorithms that can efficiently extract and summarize useful information from massive data" on the fly." Such problems can often be reduced to maximizing a submodular set function subject to various constraints. While efficient streaming methods have been recently developed for monotone submodular maximization, in a wide range of applications, such as video summarization, the underlying utility function is non-monotone, and there are often various constraints imposed on the optimization problem to consider privacy or personalization. We develop the first efficient single pass streaming algorithm, Streaming Local Search, that for any streaming monotone submodular maximization algorithm with approximation guarantee α under a collection of independence systems I, provides a constant 1/(1+ 2/√ α+ 1/α+ 2d (1+√ α)) approximation guarantee for maximizing a non-monotone submodular function under the intersection of I and d knapsack constraints. Our experiments show that for video summarization, our method runs more than 1700 times faster than previous work, while maintaining practically the same performance.
Can evolving networks be inferred and modeled without directly observing their nodes and edges? In many applications, the edges of a dynamic network might not be observed, but one can observe the dynamics of stochastic cascading processes (eg, information diffusion, virus propagation) occurring over the unobserved network. While there have been efforts to infer networks based on such data, providing a generative probabilistic model that is able to identify the underlying time-varying network remains an open question. Here we consider the problem of inferring generative dynamic network models based on network cascade diffusion data. We propose a novel framework for providing a non-parametric dynamic network model—based on a mixture of coupled hierarchical Dirichlet processes—based on data capturing cascade node infection times. Our approach allows us to infer the evolving community structure in networks and to obtain an explicit predictive distribution over the edges of the underlying network—including those that were not involved in transmission of any cascade, or are likely to appear in the future. We show the effectiveness of our approach using extensive experiments on synthetic as well as real-world networks.
How can we summarize a dynamic data stream when elements selected for the summary can be deleted at any time? This is an important challenge in online services, where the users generating the data may decide to exercise their right to restrict the service provider from using (part of) their data due to privacy concerns. Motivated by this challenge, we introduce the dynamic deletion-robust submodular maximization problem. We develop the first resilient streaming algorithm, called ROBUST-STREAMING, with a constant factor approximation guarantee to the optimum solution. We evaluate the effectiveness of our approach on several real-world applica tions, including summarizing (1) streams of geo-coordinates (2); streams of images; and (3) click-stream log data, consisting of 45 million feature vectors from a news recommendation task.
Submodular continuous functions are a category of (generally) non-convex/non-concave functions with a wide spectrum of applications. We characterize these functions and demonstrate that they can be maximized efficiently with approximation guarantees. Specifically, i) We introduce the weak DR property that gives a unified characterization of submodularity for all set, integer-lattice and continuous functions; ii) for maximizing monotone DR-submodular continuous functions under general down-closed convex constraints, we propose a Frank-Wolfe variant with (1-1/e) approximation guarantee, and sub-linear convergence rate; iii) for maximizing general non-monotone submodular continuous functions subject to box constraints, we propose a DoubleGreedy algorithm with 1/3 approximation guarantee. Submodular continuous functions naturally find applications in various real-world settings, including influence and revenue maximization with continuous assignments, sensor energy management, facility location, etc. Experimental results show that the proposed algorithms efficiently generate superior solutions compared to baseline algorithms.
We consider the problem of learning sparse representations of data sets, where the goal is to reduce a data set in manner that optimizes multiple objectives. Motivated by applications of data summarization, we develop a new model which we refer to as the two-stage submodular maximization problem. This task can be viewed as a combinatorial analogue of representation learning problems such as dictionary learning and sparse regression. The two-stage problem strictly generalizes the problem of cardinality constrained submodular maximization, though the objective function is not submodular and the techniques for submodular maximization cannot be applied. We describe a continuous optimization method which achieves an approximation ratio which asymptotically approaches 1-1/e. For instances where the asymptotics do not kick in, we design a local-search algorithm whose approximation ratio is arbitrarily close to 1/2. We empirically demonstrate the effectiveness of our methods on two multi-objective data summarization tasks, where the goal is to construct summaries via sparse representative subsets wrt to predefined objectives.
Can we summarize multi-category data based on user preferences in a scalable manner? Many utility functions used for data summarization satisfy submodularity, a natural diminishing returns property. We cast personalized data summarization as an instance of a general submodular maximization problem subject to multiple constraints. We develop the first practical and FAst coNsTrained submOdular Maximization algorithm, FANTOM, with strong theoretical guarantees. FANTOM maximizes a submodular function (not necessarily monotone) subject to intersection of a p-system and l knapsacks constrains. It achieves a (1+ ε)(p+ 1)(2p+ 2l+ 1)/p approximation guarantee with only O(nrp log (n)/ε) query complexity (n and r indicate the size of the ground set and the size of the largest feasible solution, respectively). We then show how we can use FANTOM for personalized data summarization. In particular, a p-system can model different aspects of data, such as categories or time stamps, from which the users choose. In addition, knapsacks encode users’ constraints including budget or time. In our set of experiments, we consider several concrete applications: movie recommendation over 11K movies, personalized image summarization with 10K images, and revenue maximization on the YouTube social networks with 5000 communities. We observe that FANTOM constantly provides the highest utility against all the baselines.
In this paper, we introduce the public-private framework of data summarization motivated by privacy concerns in personalized recommender systems and online social services. Such systems have usually access to massive data generated by a large pool of users. A major fraction of the data is public and is visible to (and can be used for) all users. However, each user can also contribute some private data that should not be shared with other users to ensure her privacy. The goal is to provide a succinct summary of massive dataset, ideally as small as possible, from which customized summaries can be built for each user, i.e. it can contain elements from the public data (for diversity) and users’ private data (for personalization). To formalize the above challenge, we assume that the scoring function according to which a user evaluates the utility of her summary satisfies submodularity, a widely used notion in data summarization applications. Thus, we model the data summarization targeted to each user as an instance of a submodular cover problem. However, when the data is massive it is infeasible to use the centralized greedy algorithm to find a customized summary even for a single user. Moreover, for a large pool of users, it is too time consuming to find such summaries separately. Instead, we develop a fast distributed algorithm for submodular cover, FASTCOVER, that provides a succinct summary in one shot and for all users. We show that the solution provided by FASTCOVER is competitive with that of the centralized algorithm with the number of rounds that is exponentially smaller than state of the art results. Moreover, we have implemented FASTCOVER with Spark to demonstrate its practical performance on a number of concrete applications, including personalized location recommendation, personalized movie recommendation, and dominating set on tens of millions of data points and varying number of users.
Many large-scale machine learning problems–clustering, non-parametric learning, kernel machines, etc.–require selecting a small yet representative subset from a large dataset. Such problems can often be reduced to maximizing a submodular set function subject to various constraints. Classical approaches to submodular optimization require centralized access to the full dataset, which is impractical for truly large-scale problems. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDi, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show that under certain natural conditions, performance close to the centralized approach can be achieved. We begin with monotone submodular maximization subject to a cardinality constraint, and then extend this approach to obtain approximation guarantees for (not necessarily monotone) submodular maximization subject to more general constraints including matroid or knapsack constraints. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop.
Is it possible to maximize a monotone submodular function faster than the widely used lazy greedy algorithm (also known as accelerated greedy), both in theory and practice? In this paper, we develop the first linear-time algorithm for maximizing a general monotone submodular function subject to a cardinality constraint. We show that our randomized algorithm, STOCHASTIC-GREEDY, can achieve a (1− 1/e− ε) approximation guarantee, in expectation, to the optimum solution in time linear in the size of the data and independent of the cardinality constraint. We empirically demonstrate the effectiveness of our algorithm on submodular functions arising in data summarization, including training large-scale kernel methods, exemplar-based clustering, and sensor placement. We observe that STOCHASTIC-GREEDY practically achieves the same utility value as lazy greedy but runs much faster. More surprisingly, we observe that in many practical scenarios STOCHASTIC-GREEDY does not evaluate the whole fraction of data points even once and still achieves indistinguishable results compared to lazy greedy.
How can one find a subset, ideally as small as possible, that well represents a massive dataset? Ie, its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. In this paper, we formalize this challenge as a submodular cover problem. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condition preva-lent in many data summarization applications. The classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution. However, this sequential, centralized approach is imprac-tical for truly large-scale problems. In this work, we develop the first distributed algorithm–DISCOVER–for submodular set cover that is easily implementable using MapReduce-style computations. We theoretically analyze our approach, and present approximation guarantees for the solutions returned by DISCOVER. We also study a natural trade-off between the communication cost and the num-ber of rounds required to obtain such a solution. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, includ-ing active set selection, exemplar based clustering, and vertex cover on tens of millions of data points using Spark.
2014
KDD
Streaming submodular maximization: Massive data summarization on the fly
How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.
NetSciCom
Modeling the impact of user awareness on immunization strategies
Despite the efforts to design better antivirus software, malware continue to spread and cause enormous damages. Effect of immunizing computer systems as the most effective control policy for preventing such infections is two-fold. On one hand, it increases the global immunity of the network by providing indirect protection for unimmunized systems. On the other hand, raising the awareness of users from the possibility of infection can trigger behavioral changes by which users take measures to reduce their systems’ susceptibility using the antivirus software. Here, we propose the Behavior-Immunity model that allows measurement of vaccination effect based on the indirect protective effect of immunization strategies. It also provides a mean to utilize human behavioral changes to enhance the effectiveness of immunization strategies. In this work, we focus on the word of mouth as the source of user awareness and show that immunization schema can appropriately utilized the behavioral changes to practice better results. We also present a methodology for network immunization which is provably close to the optimal solution. Extensive computational experiments on some synthetic and real-world networks revealed that this strategy offers a significant improvement over well-studied targeted immunization method based on degree centrality.
2013
SNAM
Revenue maximization in social networks through discounting
Social networking has become a part of daily life for many individuals across the world. Widespread adoption of various strategies in such networks can be utilized by business corporations as a powerful means for advertising. In this study, we investigated viral marketing strategies in which buyers are influenced by other buyers who already own an item. Since finding an optimal marketing strategy is NP-hard, a simple strategy has been proposed in which giving the item for free to a subset of influential buyers in a network increases the valuation of the other potential buyers for the item. In this study, we considered the more general problem by offering discounts instead of giving the item for free to an initial set of buyers. We introduced three approaches for finding an appropriate discount sequence based on the following iterative idea: In each step, we offer the item to the potential buyers with a discounted price in a way that they all accept the offers and buy the product. Selling the item to the most influential buyers as the opinion leaders increases the willingness of other buyers to pay a higher price. Thus, in the following steps, we can offer the item with a lower discount while still guaranteeing the acceptance of the offers. Furthermore, we investigated two marketing strategies based on local search and hill climbing algorithms. Extensive computational experiments on artificially constructed model networks as well as on a number of real-world networks revealed the effectiveness of the proposed discount-based strategies.
Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable, representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference on tens of millions of examples using Hadoop.
In this letter we studied the epidemic spreading on scale-free networks assuming a limited budget for immunization. We proposed a general model in which the immunity of an individual against the disease depends on its immunized friends in the network. Furthermore, we considered the possibility that each individual might be eager to pay a price to buy the vaccine and become immune against the disease. Under these assumptions we proposed an algorithm for improving the performance of all previous immunization algorithms. We also introduced a heuristic extension of the algorithm, which works well in scale-free networks.
Many technological networks can experience random and/or systematic failures in their components. More destructive situations can happen if the components have limited capacity, where the failure in one of them might lead to a cascade of failures in other components, and consequently break down the structure of the network. In this paper, the tolerance of cascaded failures was investigated in weighted networks. Three weighting strategies were considered including the betweenness centrality of the edges, the product of the degrees of the end nodes, and the product of their betweenness centralities. Then, the effect of the cascaded attack was investigated by considering the local weighted flow redistribution rule. The capacity of the edges was considered to be proportional to their initial weight distribution. The size of the survived part of the attacked network was determined in model networks as well as in a number of real-world networks including the power grid, the internet in the level of autonomous system, the railway network of Europe, and the United States airports network. We found that the networks in which the weight of each edge is the multiplication of the betweenness centrality of the end nodes had the best robustness against cascaded failures. In other words, the case where the load of the links is considered to be the product of the betweenness centrality of the end nodes is favored for the robustness of the network against cascaded failures.
PLoS
Failure tolerance of motif structure in biological networks
Complex networks serve as generic models for many biological systems that have been shown to share a number of common structural properties such as power-law degree distribution and small-worldness. Real-world networks are composed of building blocks called motifs that are indeed specific subgraphs of (usually) small number of nodes. Network motifs are important in the functionality of complex networks, and the role of some motifs such as feed-forward loop in many biological networks has been heavily studied. On the other hand, many biological networks have shown some degrees of robustness in terms of their efficiency and connectedness against failures in their components. In this paper we investigated how random and systematic failures in the edges of biological networks influenced their motif structure. We considered two biological networks, namely, protein structure network and human brain functional network. Furthermore, we considered random failures as well as systematic failures based on different strategies for choosing candidate edges for removal. Failure in the edges tipping to high degree nodes had the most destructive role in the motif structure of the networks by decreasing their significance level, while removing edges that were connected to nodes with high values of betweenness centrality had the least effect on the significance profiles. In some cases, the latter caused increase in the significance levels of the motifs.
ICC
Reuse-Attack Mitigation in Wireless Sensor Networks
Privacy preservation in wireless sensor networks has drawn considerable attention from research community during last few years. Emergence of single-owner, multi-user commercial sensor networks along with hostile and uncontrollable environment of such networks, makes the security issue in such networks of a great importance. This paper concentrates on token-based privacy preservation schemes. A possible attack on such schemes is introduced and two different approaches are utilized to mitigate the attack. Mathematical models for considering the attack effect and overhead are presented and the results are verified using extensive simulations
2009
ISPA
Utility proportional optimization flow control for overlay multicast
A deniable authentication allows the receiver to identify the source of the received messages but cannot prove it to any third party. However, the deniability of the content, which is called restricted deniability in this paper, is concerned in electronic voting and some other similar application. At present, most non-interactive deniable authentication protocols cannot resist weaken key-compromise impersonation (W-KCI) attack. To settle this problem, a non-interactive identity-based restricted deniable authentication protocol is proposed. It not only can resist W-KCI attack but also has the properties of communication flexibility. It meets the security requirements such as correctness, restricted deniability as well. Therefore, this protocol can be applied in electronic voting.