- Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape The complexity and diversity of today's media landscape provides many challenges for researchers studying news producers. These producers use many different strategies to get their message believed by readers through the writing styles they employ, by repetition across different media sources with or without attribution, as well as other mechanisms that are yet to be studied deeply. To better facilitate systematic studies in this area, we present a large political news data set, containing over 136K news articles, from 92 news sources, collected over 7 months of 2017. These news sources are carefully chosen to include well-established and mainstream sources, maliciously fake sources, satire sources, and hyper-partisan political blogs. In addition to each article we compute 130 content-based and social media engagement features drawn from a wide range of literature on political bias, persuasion, and misinformation. With the release of the data set, we also provide the source code for feature computation. In this paper, we discuss the first release of the data set and demonstrate 4 use cases of the data and features: news characterization, engagement characterization, news attribution and content copying, and discovering news narratives. 4 authors · Mar 27, 2018
1 Do GPTs Produce Less Literal Translations? Large Language Models (LLMs) such as GPT-3 have emerged as general-purpose language models capable of addressing many natural language generation or understanding tasks. On the task of Machine Translation (MT), multiple works have investigated few-shot prompting mechanisms to elicit better translations from LLMs. However, there has been relatively little investigation on how such translations differ qualitatively from the translations generated by standard Neural Machine Translation (NMT) models. In this work, we investigate these differences in terms of the literalness of translations produced by the two systems. Using literalness measures involving word alignment and monotonicity, we find that translations out of English (E-X) from GPTs tend to be less literal, while exhibiting similar or better scores on MT quality metrics. We demonstrate that this finding is borne out in human evaluations as well. We then show that these differences are especially pronounced when translating sentences that contain idiomatic expressions. 4 authors · May 26, 2023
- Learning by Reconstruction Produces Uninformative Features For Perception Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task. 2 authors · Feb 17, 2024
16 Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections We tackle the problem of generating highly realistic and plausible mirror reflections using diffusion-based generative models. We formulate this problem as an image inpainting task, allowing for more user control over the placement of mirrors during the generation process. To enable this, we create SynMirror, a large-scale dataset of diverse synthetic scenes with objects placed in front of mirrors. SynMirror contains around 198K samples rendered from 66K unique 3D objects, along with their associated depth maps, normal maps and instance-wise segmentation masks, to capture relevant geometric properties of the scene. Using this dataset, we propose a novel depth-conditioned inpainting method called MirrorFusion, which generates high-quality geometrically consistent and photo-realistic mirror reflections given an input image and a mask depicting the mirror region. MirrorFusion outperforms state-of-the-art methods on SynMirror, as demonstrated by extensive quantitative and qualitative analysis. To the best of our knowledge, we are the first to successfully tackle the challenging problem of generating controlled and faithful mirror reflections of an object in a scene using diffusion based models. SynMirror and MirrorFusion open up new avenues for image editing and augmented reality applications for practitioners and researchers alike. 6 authors · Sep 22, 2024 3
1 Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work. 8 authors · Feb 24
1 Moral Mimicry: Large Language Models Produce Moral Rationalizations Tailored to Political Identity Large Language Models (LLMs) have demonstrated impressive capabilities in generating fluent text, as well as tendencies to reproduce undesirable social biases. This study investigates whether LLMs reproduce the moral biases associated with political groups in the United States, an instance of a broader capability herein termed moral mimicry. This hypothesis is explored in the GPT-3/3.5 and OPT families of Transformer-based LLMs. Using tools from Moral Foundations Theory, it is shown that these LLMs are indeed moral mimics. When prompted with a liberal or conservative political identity, the models generate text reflecting corresponding moral biases. This study also explores the relationship between moral mimicry and model size, and similarity between human and LLM moral word use. 1 authors · Sep 24, 2022
- AVIDa-hIL6: A Large-Scale VHH Dataset Produced from an Immunized Alpaca for Predicting Antigen-Antibody Interactions Antibodies have become an important class of therapeutic agents to treat human diseases. To accelerate therapeutic antibody discovery, computational methods, especially machine learning, have attracted considerable interest for predicting specific interactions between antibody candidates and target antigens such as viruses and bacteria. However, the publicly available datasets in existing works have notable limitations, such as small sizes and the lack of non-binding samples and exact amino acid sequences. To overcome these limitations, we have developed AVIDa-hIL6, a large-scale dataset for predicting antigen-antibody interactions in the variable domain of heavy chain of heavy chain antibodies (VHHs), produced from an alpaca immunized with the human interleukin-6 (IL-6) protein, as antigens. By leveraging the simple structure of VHHs, which facilitates identification of full-length amino acid sequences by DNA sequencing technology, AVIDa-hIL6 contains 573,891 antigen-VHH pairs with amino acid sequences. All the antigen-VHH pairs have reliable labels for binding or non-binding, as generated by a novel labeling method. Furthermore, via introduction of artificial mutations, AVIDa-hIL6 contains 30 different mutants in addition to wild-type IL-6 protein. This characteristic provides opportunities to develop machine learning models for predicting changes in antibody binding by antigen mutations. We report experimental benchmark results on AVIDa-hIL6 by using neural network-based baseline models. The results indicate that the existing models have potential, but further research is needed to generalize them to predict effective antibodies against unknown mutants. The dataset is available at https://avida-hil6.cognanous.com. 11 authors · Jun 5, 2023
- A quantum teleportation inspired algorithm produces sentence meaning from word meaning and grammatical structure We discuss an algorithm which produces the meaning of a sentence given meanings of its words, and its resemblance to quantum teleportation. In fact, this protocol was the main source of inspiration for this algorithm which has many applications in the area of Natural Language Processing. 5 authors · May 2, 2013
- Going beyond Compositions, DDPMs Can Produce Zero-Shot Interpolations Denoising Diffusion Probabilistic Models (DDPMs) exhibit remarkable capabilities in image generation, with studies suggesting that they can generalize by composing latent factors learned from the training data. In this work, we go further and study DDPMs trained on strictly separate subsets of the data distribution with large gaps on the support of the latent factors. We show that such a model can effectively generate images in the unexplored, intermediate regions of the distribution. For instance, when trained on clearly smiling and non-smiling faces, we demonstrate a sampling procedure which can generate slightly smiling faces without reference images (zero-shot interpolation). We replicate these findings for other attributes as well as other datasets. Our code is available at https://github.com/jdeschena/ddpm-zero-shot-interpolation. 4 authors · May 29, 2024
- ML4CO: Is GCNN All You Need? Graph Convolutional Neural Networks Produce Strong Baselines For Combinatorial Optimization Problems, If Tuned and Trained Properly, on Appropriate Data The 2021 NeurIPS Machine Learning for Combinatorial Optimization (ML4CO) competition was designed with the goal of improving state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning models. The competition's main scientific question was the following: is machine learning a viable option for improving traditional combinatorial optimization solvers on specific problem distributions, when historical data is available? This was motivated by the fact that in many practical scenarios, the data changes only slightly between the repetitions of a combinatorial optimization problem, and this is an area where machine learning models are particularly powerful at. This paper summarizes the solution and lessons learned by the Huawei EI-OROAS team in the dual task of the competition. The submission of our team achieved the second place in the final ranking, with a very close distance to the first spot. In addition, our solution was ranked first consistently for several weekly leaderboard updates before the final evaluation. We provide insights gained from a large number of experiments, and argue that a simple Graph Convolutional Neural Network (GCNNs) can achieve state-of-the-art results if trained and tuned properly. 2 authors · Dec 22, 2021